It seems kinda silly to use a separate service to generate embeddings for t-SNE ...

higuidebot · on Feb 17, 2025

Is it generating embeddings or just coordinates? What would be a better way?

gavmor · on Feb 17, 2025

What are embeddings if not "just coordinates"?

higuidebot · on Feb 17, 2025

Well ... we have to reduce them to a 2D plane to visualize them ...

datameta · on Feb 17, 2025

That just makes them higher order coordinates, no?

gavmor · on Feb 17, 2025

Higher order, yes, but as these coordinates certainly contain less information, it's possible they contain only noise.

mikeshi42 · on Feb 17, 2025

Something needs to generate the document embeddings since the LLM itself won't

ipsum2 · on Feb 18, 2025

No, this is completely wrong. You can get embeddings from the LLM itself, e.g the last layer.

mikeshi42 · on Feb 18, 2025

Doesn't the last layer output a variable-size vector based on seq length? It'd take a bit of hacking to get it to be a semantic vector.

Additionally, that vector is trained to predict next token as opposed to semantic similarity. I'd assume models trained specifically towards semantic similarity would outperform (I have not bothered comparing both in the past - MMTEB seems to imply so)

At that point - it seems quite reasonable to just pass the sentence into an embedding model.