Are you using apple silicon? How much RAM do you have, and how many tokens/secon...

syntaxing · on Oct 11, 2023

Yes, 16GB of ram is needed for 13B, 32GB for 34B (both for 4bit). The first time it loads a new models takes some warm up time, I wanna say 30s? After that, the context reading and token generation are usually upward of 8 tk/s. Also, the newer and bigger the die, the faster the token generation. Like a Mac Studio would probably generate 30% or so faster than a MBP

tarruda · on Oct 11, 2023

8tk/s on 34b?

I've managed to run Codellama instruct 13b with my laptop's RTX 3070 (8gb VRAM) at 6tk/s by offloading 27 layers into the GPU with llama.cpp

I've been considering getting a macbook for running 34b+ LLM inference, but with the speed in which small LLMs are progressing, I think it is better to get a laptop with an RTX 4090 and 16gb vram. Maybe It can run 34b models by offloading layers into the GPU.

syntaxing · on Oct 11, 2023

I only have a 16GB computer so I can’t confirm the 34B performance. I have a 3090 with 24GB of VRAM and 34B just fits and runs above 15 tk/s. If you want a laptop and only plan for inferencing, I think a MBP would be better than a 4090 laptop.

wahnfrieden · on Oct 11, 2023

No warm up if you switch to metal with no ANE on sonoma