It's using similarity search[0]. You transform audio files into vectors using some embedding model like vggish[1], then index those vectors. Then turn the query audio file into a vector the same way, and search the index for vectors similar to your query.
Full disclosure: I work for Pinecone, which provides similarity search as a service... Including for audio data.
Can you comment (either generally or wrt your product) on specificity/generality/constraints when searching for audio? The image-based examples are intuitive, but what features can one search for in audio? i.e. Can it detect a specific singer's voice, or an instrument or effect? Only more general features like genre, tempo, etc? I assume it's some combination, but it's not clear to me what this type of neural search is "good at".
Looking at the VGGish paper itself, I see they use spectrograms as inputs, they show results where they can identify instrument types. I'm not too sure how specific embeddings from these models can be. Do we know if spectrograms can differentiate between two people's voice?