Yes, 16GB of ram is needed for 13B, 32GB for 34B (both for 4bit). The first time it loads a new models takes some warm up time, I wanna say 30s? After that, the context reading and token generation are usually upward of 8 tk/s. Also, the newer and bigger the die, the faster the token generation. Like a Mac Studio would probably generate 30% or so faster than a MBP
I've managed to run Codellama instruct 13b with my laptop's RTX 3070 (8gb VRAM) at 6tk/s by offloading 27 layers into the GPU with llama.cpp
I've been considering getting a macbook for running 34b+ LLM inference, but with the speed in which small LLMs are progressing, I think it is better to get a laptop with an RTX 4090 and 16gb vram. Maybe It can run 34b models by offloading layers into the GPU.
I only have a 16GB computer so I can’t confirm the 34B performance. I have a 3090 with 24GB of VRAM and 34B just fits and runs above 15 tk/s. If you want a laptop and only plan for inferencing, I think a MBP would be better than a 4090 laptop.