For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.
For 70B models, I usually get 15-25 t/s on my laptop. Obviously that heavily depends on which quant, context length, etc. I usually roll with q5s, since the loss is so minuscule.
How many tokens/second is that approx?
For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.