Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there an explanation what exactly does it do? It seems like this is searching audio, but based on what and what audio?


It's using similarity search[0]. You transform audio files into vectors using some embedding model like vggish[1], then index those vectors. Then turn the query audio file into a vector the same way, and search the index for vectors similar to your query.

Full disclosure: I work for Pinecone, which provides similarity search as a service... Including for audio data.

[0] https://www.pinecone.io/learn/what-is-similarity-search/

[1] https://tfhub.dev/google/vggish/1


Can you comment (either generally or wrt your product) on specificity/generality/constraints when searching for audio? The image-based examples are intuitive, but what features can one search for in audio? i.e. Can it detect a specific singer's voice, or an instrument or effect? Only more general features like genre, tempo, etc? I assume it's some combination, but it's not clear to me what this type of neural search is "good at".


Looking at the VGGish paper itself, I see they use spectrograms as inputs, they show results where they can identify instrument types. I'm not too sure how specific embeddings from these models can be. Do we know if spectrograms can differentiate between two people's voice?


Spectrogram seems too coarse-grained to make the differentiation, but I would have thought the same thing about instrument types.


Seems to be based on features generated using a model similar to that proposed in [1].

[1] https://research.google/pubs/pub45611/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: