Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, 16GB of ram is needed for 13B, 32GB for 34B (both for 4bit). The first time it loads a new models takes some warm up time, I wanna say 30s? After that, the context reading and token generation are usually upward of 8 tk/s. Also, the newer and bigger the die, the faster the token generation. Like a Mac Studio would probably generate 30% or so faster than a MBP


8tk/s on 34b?

I've managed to run Codellama instruct 13b with my laptop's RTX 3070 (8gb VRAM) at 6tk/s by offloading 27 layers into the GPU with llama.cpp

I've been considering getting a macbook for running 34b+ LLM inference, but with the speed in which small LLMs are progressing, I think it is better to get a laptop with an RTX 4090 and 16gb vram. Maybe It can run 34b models by offloading layers into the GPU.


I only have a 16GB computer so I can’t confirm the 34B performance. I have a 3090 with 24GB of VRAM and 34B just fits and runs above 15 tk/s. If you want a laptop and only plan for inferencing, I think a MBP would be better than a 4090 laptop.


No warm up if you switch to metal with no ANE on sonoma




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: