> run great How many tokens/second is that approx? For reference, Qwen 2.5 32B o... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		diggan on Sept 19, 2024 \| parent \| context \| favorite \| on: Qwen2.5: A Party of Foundation Models > run great How many tokens/second is that approx? For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.

a_wild_dandan on Sept 19, 2024 [–]

For 70B models, I usually get 15-25 t/s on my laptop. Obviously that heavily depends on which quant, context length, etc. I usually roll with q5s, since the loss is so minuscule.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact