Nvidia Turing (RTX 20) definitely marked a major shift IMO.
- It was the first card to enable real-time ray-traced effects.
- Mesh shaders are a significant overhaul of the geometry pipeline that's only recently getting real traction.
- Its tensor cores enabled a new generation of AI-driven upscaling/antialiasing. DLSS 2, FSR 4 and XeSS are all some variation of "TAA + neural networks", and these all rely on specialized matrix hardware to get optimal performance.
Obviously all of these features are supported across all vendors. Intel Arc Alchemist has all of these features as well, and AMD got RT and mesh shader support with RDNA2 along with slowly building up to tensor cores with RDNA3/4. But Turing clearly debuted these feature which have majorly changed the landscape of realtime 3D graphics.
838 seems to be the real INT8 TOPS number for the 5090; going from 800 to 3400 takes an x2 speedup for sparsity (so skipping ops) and another x2 speedup for FP4 over INT8.
So it's closer to half the speed than a tenth. Intel also seems to be positioning this card against the RTX PRO 4000 Blackwell, not the 5090, and that one gets more like 300 INT8 TOPS. It also has less memory but at a slightly higher bandwidth. The 5090 is much faster and IIRC priced similarly to the PRO 4000, but is also decidedly a consumer product which, especially for Nvidia, comes with limitations (e.g. no server-friendly form factor cards available, and there are or used to be driver license restrictions that prevented using a consumer card in a data center setup).
Thank you for the correction. That seemed way too lopsided to be believed. This assessment balances the memory to tops ratio much much more evenly, which is to be expected! I was low key hoping someone would help me make sense of how wildly disparate figures were, but I wasn't seeing.
AMD R9700 is 378/766 tops int8 dense/sparse. 644GB/s of 32GB memory. ~$1400. To throw one more card into the mix. Intel undercutting that nicely here.
You're right that for companies, the pro grade matters. For us mere mortals, much less so. Features like sr-iov however are just fantastic so see! Good job Intel. AMD has been trickling out such capabilities for a decade (cards fused for "MxGPU" capability) & it makes it such an easier buy to just offer it straight up across the models.
Aren't Intel Xeon Rapids and Intel Xeon Forest just different target markets? Rapids has fewer but faster cores in general, and more special-purpose accelerators (e.g. AMX, QAT), while Forest is focused on maximum compute density (just pack in as many fast-enough cores as you can).
IIRC Granite Rapids is also not _that_ old, and either current or a single generation behind. (Has its successor landed yet? IIRC GNR is the same generation as Sierra Forest).
Very cool! I am wondering one thing: how fast is it? Much of the "secret sauce" of the Voodoo is its high speed: a first-gen Verite or (God forbid) any ViRGE takes many more cycles for common operations like, say, Z-buffered pixels.
I'm guessing this isn't fully cycle-accurate, but is it at least somewhat "IPC-accurate"? I'm guessing yes? But much of that was also derived from Voodoo's (for the time) crazy high memory bandwidth AFAIK.
The Voodoo was fast but also expensive, and you needed an additional VGA card. I think it was around USD 300 back then, that's more than USD 600 today and you'll still need another card.
GPT-OSS is tailored to be extremely memory efficient. Not only is it natively using the 4.25 bit per token MXFP4 format, but it also uses sliding window attention for half of its layers. It also doesn't have that many layers, only 36 for the 120B version and 24 for the 120B version. (The 120B is also much much sparser than the 20B.)
I found a Reddit comment claiming only 36 KiB per token. With that, half a million tokens fits in 18 GB, which is less than one GPU. And three GPUs fit the parameters with room to spare (64 out of 72 GB).
If generating synthetic data is such a great way to improve performance, why would it not be applied to the slowrun? Especially for the unlimited compute track, you should have plenty of time to generate as much synthetic data as your heart desires.
Intuitively, I would expect the synthetic data to mostly just "regurgitate" the existing data, and not add much. But I could be wrong of course, and perhaps doing reinforcement learning somewhere could solve that issue as well (though I don't know if there is much hidden in FineWeb that you could RL on; at best you can do self-verification probably?)
Interesting; I was not aware of those "universal synthetics" but they make sense: a stronger reasoning base would make modeling tasks easier. Thanks for the link!
Again, though, if those work I assume they will be used for the slowrun. Surely a few hundred LoC to generate data would not be considered cheating :)
I like the idea of LLM-calling as an automation-friendly CLI tool! However, putting all my agents in ~/.config feels antithetical to this. My Bash scripts do not live there either, but rather in a separate script collection, or preferably, at their place of use (e.g. in a repo).
For example, let's say I want to add commit message generation (which I don't think is a great use of LLMs, but it is a practical example) to a repo. I would add the appropriate hook to /.git, but I would also want the agent with its instructions to live inside the repo (perhaps in an `axe` or `agents` directory).
Can Axe load agents from the current folder? Or can that be added?
Interesting read! One remark though: I'm not too familiar with the architecture of a Google TPU, but comparing the TPU's VMEM with Nvidia's shared memory feels wrong to me.
Looking at the size, and its shared nature, it feels far more natural to compare with the L2 cache, which is also shared across the entire GPU and is in the same order of size (40MB on the listed A100).
The reason for that is that most memory bandwidth bumps come with new memory generations. For example an early DDR4 platform (e.g. Intel Skylake/Core iX-6000) and a late one (e.g. AMD Zen3/Ryzen 5000) only differ by 1.5x as well, typically.
The same trend is visible in GPUs: for example, my RTX 2070 (GDDR6) has the same memory bandwidth as a 3070 and only a little bit less than a 4070 (GDDR6X). However, a 5070 does get significantly more bandwidth due to the jump to GDDR7. Lower-end cards like the 4060 even stuck to GDDR6, which gave them a bandwidth deficit compared to a 3060 due to the narrower memory buses on the 40 series.
It's not just Qwen; we also recently had GLM-4.7-Flash in the same roughly 30B-A3 range. Seems to me like there's no shortage of competition for good old GPT-OSS 20B (not just Qwen3.5-35B and GLM-4.7-Flash, but also Qwen3(-Coder)-30B or Granite 4 Small).
- It was the first card to enable real-time ray-traced effects. - Mesh shaders are a significant overhaul of the geometry pipeline that's only recently getting real traction. - Its tensor cores enabled a new generation of AI-driven upscaling/antialiasing. DLSS 2, FSR 4 and XeSS are all some variation of "TAA + neural networks", and these all rely on specialized matrix hardware to get optimal performance.
Obviously all of these features are supported across all vendors. Intel Arc Alchemist has all of these features as well, and AMD got RT and mesh shader support with RDNA2 along with slowly building up to tensor cores with RDNA3/4. But Turing clearly debuted these feature which have majorly changed the landscape of realtime 3D graphics.
reply