That’s 68 billions of parameters. It probably does not fit on ram. Though If you...

terafo · on March 13, 2023

It fits, whisper.cpp uses 4 bit quantization, 13B model takes a little bit more than 8gb and around 9gb ram while inferencing.

gymbeaux · on March 13, 2023

Everyone with “only” 64GB of RAM is pouting today, including me

numpad0 · on March 13, 2023

More like finally "proven right" to have needlessly kept feeding 4/5th of 64GB to Chrome since 2018

Taek · on March 13, 2023

You can run llama using 4 bits per parameter, 64 GB of RAM is more than enough

geysersam · on March 13, 2023

4 bits is ridiculously little. I'm very curious what makes these models so robust to quantization.

MacsHeadroom · on March 13, 2023

Read The Case for 4 Bit Precision. https://arxiv.org/abs/2212.09720

Spoiler: it's the parameter count. As parameter count goes up, but depth matters less.

It just so happens that at around 10B+ parameters you can quantize down to 4bit with essentially no downsides. Models are that big now. So there's no need to waste RAM by having unnecessary precision for each parameter.

Taek · on March 13, 2023

For completeness, there's also another paper that demonstrated you get more power/accuracy per-bit at 4 bits than at any other level of precision (including 2 bits and 3 bits)

MacsHeadroom · on March 14, 2023

That's the paper I referenced. But newer research is already challenging it.

'Int-4 llama is not enough [0] - Int-3 and beyond' suggests 3-bit is best for models larger than ~10B parameters when combining binning and GPTQ.

[0] https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...

metadat · on March 13, 2023

What if you have around 400GB of RAM? Would this be enough?

gymbeaux · on March 13, 2023

What I'm referring to requires around 67GB of RAM. With 400GB I would imagine you are in good shape for running most of these GPT-type models.

taf2 · on March 13, 2023

Seems to use about 40~ GB RAM here...