It would be nice to have comparisons to Claude 3.5 for the coder model, only com...

imjonse · on Sept 18, 2024

Aider will probably have some numbers at https://aider.chat/docs/leaderboards/

Deathmax · on Sept 18, 2024

They've posted their own run of the Aider benchmark [1] if you want to compare, it achieved 57.1%.

[1]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5/Qwen...

reissbaker · on Sept 19, 2024

Oof. I'm really not sure why companies keep releasing these mini coding models; 57.1% is worse than gpt-3.5-turbo, and running it locally will be slower than OpenAI's API. I guess you could use it if you took your laptop into the woods, but with such poor coding ability, would you even want to?

The Qwen2.5-72B model seems to do pretty well on coding benchmarks, though — although no word about Aider yet.

diggan · on Sept 18, 2024

Here is a comparison of the prompt "I want to create a basic Flight simulator in Bevy and Rust. Help me figure out the core properties I need for take off, in air flight and landing" between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:

https://gist.github.com/victorb/7749e76f7c27674f3ae36d791e20...

AFAIK, there isn't any (micro)benchmark comparisons out yet.

yourMadness · on Sept 18, 2024

14B with Q4_K_M quantization is about 9 GB.

Remarkable that it is at all comparable to Sonnet 3.5

diggan · on Sept 18, 2024

Comparable, I guess. But the result is a lot worse compared to Sonnet for sure. Parts of the example code doesn't make much sense. Meanwhile Sonnet seems to have the latest API of Bevy considered, and mostly makes sense.

Sn0wCoder · on Sept 18, 2024

This might be what you are asking for... https://qwenlm.github.io/blog/qwen2.5-coder/

Ctrl F - Code Reasoning: