Is there an explanation what exactly does it do? It seems like this is searching...

gk1 · on April 5, 2021

It's using similarity search[0]. You transform audio files into vectors using some embedding model like vggish[1], then index those vectors. Then turn the query audio file into a vector the same way, and search the index for vectors similar to your query.

Full disclosure: I work for Pinecone, which provides similarity search as a service... Including for audio data.

[0] https://www.pinecone.io/learn/what-is-similarity-search/

[1] https://tfhub.dev/google/vggish/1

willseth · on April 5, 2021

Can you comment (either generally or wrt your product) on specificity/generality/constraints when searching for audio? The image-based examples are intuitive, but what features can one search for in audio? i.e. Can it detect a specific singer's voice, or an instrument or effect? Only more general features like genre, tempo, etc? I assume it's some combination, but it's not clear to me what this type of neural search is "good at".

rkt08 · on April 5, 2021

Looking at the VGGish paper itself, I see they use spectrograms as inputs, they show results where they can identify instrument types. I'm not too sure how specific embeddings from these models can be. Do we know if spectrograms can differentiate between two people's voice?

willseth · on April 5, 2021

Spectrogram seems too coarse-grained to make the differentiation, but I would have thought the same thing about instrument types.

wfhpw · on April 5, 2021

Seems to be based on features generated using a model similar to that proposed in [1].

[1] https://research.google/pubs/pub45611/