You can still buy used 3090 cards on ebay. 5 of them will give you 120GB of memory and will blow away any mac in terms of performance on LLM workloads. They have gone up in price lately and are now about $1100 each, but at one point they were $700-800 each.
FWIW I have never used NVLink, and I’m not sure why people are bringing up “daisy chaining” because as far as I’m aware that is not a thing with modern GPUs at all.
> The mac will just work for models as large as 100B, can go higher with quantized models. And power draw will be 1/5th as much as the 3090 setup.
This setup will work for 100B models as well. And yes, the Mac will draw less power, but the Nvidia machine will be many times faster. So depending on your specific Mac and your specific Nvidia setup, the performance per watt will be in the same ballpark. And higher absolute performance is certainly a nice perk.
> You can certainly daisy chain several 3090's together but it doesn't work seamlessly.
Citation needed; there's no "daisy chaining" in the setup I describe, and low level libraries like pytorch as well as higher level tools like Ollama all seamlessly support multiple GPUs.
1800W is the max on a 15A circuit, but yes, it’s usually under 1600W. For LLM inference, limiting the TDP to 225W or so per card saves a lot of power, for a 5% drop in performance.
> I think it's bad form to say "citation needed" when your original claim didn't include citations.
I apologize, but using multiple GPUs for inference (without any sort of “daisy chaining”) is something that’s been supported in most LLM tooling for a long time.
> Regardless - there's a difference between training and inference.
No one brought up training vs. inference to my knowledge, besides you — I was assuming the machine was for inference, because my experience building a machine like the one I described was in order to do inference. If you want to train models, I know less about that, but I’m pretty sure the tooling does easily support multiple GPUs.
> And pytorch doesn't magically make 5 gpus behave like 1 gpu.
I never said it was magic, I just said it was supported, which it is.