Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> run great

How many tokens/second is that approx?

For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.



For 70B models, I usually get 15-25 t/s on my laptop. Obviously that heavily depends on which quant, context length, etc. I usually roll with q5s, since the loss is so minuscule.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: