I'm not sure why the voyage-3 models aren't on the MTEB leaderboard. The code fo...

fzliu · on Nov 1, 2024

(I work at Voyage)

Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.

kkielhofner · on Nov 1, 2024

> we don't want to hurt performance on other real-world tasks just to do well on MTEB

Nice!

Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.

jdthedisciple · on Nov 1, 2024

It would still be great to know how it compares?

Why should I pick voyage-3 if for all I know it sucks when it comes to retrieval accuracy (my personally most important metric)?

fzliu · on Nov 1, 2024

We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).

newrotik · on Nov 1, 2024

It is unclear this model should be on that leaderboard because we don't know whether it has been trained on mteb test data.

It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.

This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.

[0] see excel spreadsheet linked here https://blog.voyageai.com/2024/09/18/voyage-3/

jdthedisciple · on Nov 1, 2024

I'm critical of the low number of embedding dims.

Could hurt performance in niche applications, in my estimation.

Looking forward to try the announced large models though.