Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.
We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).
It is unclear this model should be on that leaderboard because we don't know whether it has been trained on mteb test data.
It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.
This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.
But I don't see them when I filter the list for 'voyage'.