Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can run local models on a 10 year old laptop. As always the answer is "it depends".

The things you need: memory bandwidth, memory capacity, compute. The more of each the better. The 4060 generally has very poor bandwidth (worse than the 3060) due to its limited bus, but being able to offload more is still generally better.

32GB systems can load 8B models at fp16, 12B at 8 bits, 30B at 4 bits, 70B at 2 bits (roughly speaking). 64GB would be a good minimum if you want to use 70B at 4 bits. Without significant offloading it will be very slow though.

If you want to process long contexts in a decent amount of time it's best to run models with flash attention which requires you to have the KV cache on the GPU. It also lets you use 4 bit cache, which quadruples the amount of context you can fit.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: