Lisa Su saved AMD – Now she wants Nvidia's AI crown

makomk · on June 2, 2023

The trouble is that AMD just didn't take AI seriously. For a long time, their equivalent of CUDA was not only Linux only and a pain to use but was outright broken on all consumer cards - as in, they dropped official support for the only consumer cards it officially ran on, promptly broke it so that machine learning runs failed, and dismissed bug reports from the users who were left high and dry because their cards were no longer officially supported. The only way to use AMD for machine learning was to pay out much more than the price of NVidia's consumer cards for server-focused AMD cards that worked worse, were harder to use, and that AMD didn't support for long either. They just never had the small-scale desktop usage that lead to NVidia's cards being the choice for bigger machine learning once scaled up because it simply didn't work.

amunicio · on June 2, 2023

> The trouble is that AMD just didn't take AI seriously.

Until a couple of years ago, AMD was in survival mode, fighting Intel on one side and Nvidia on the other. Two rivals that were making money hand over fist while AMD was bleeding money.

AMD picked open standards and made investments on open source frameworks and libraries commensurate with their financials, the hope being that the community could help pick up some of the slack. The community, understandably, went with the proprietary solution that worked well at the time and had resources behind.

The net results is that the Nvidia ecosystem has gained a dominant position in the industry and benefits from being perceived as a quasi-standard. On the other hand, open source efforts by AMD or others get viewed as "not serious".

The financial situation of AMD has improved somewhat over the last couple years. So AMD is "taking AI more seriously now". But it might be too late and the proprietary ecosystem has probably won.

parker_mountain · on June 2, 2023

For what it's worth, AMD is also incredibly proprietary. The drivers being open source really helps with compatibility and your kernel, but you're still interacting with a massive computer running it's own OS with its own trusted code solution. And that computer also has DMA to your computer.

I would consider their open efforts to be "not serious" for anyone but the consumer space - games, desktop users, maybe even professional text editors. If you're using the GPUs for "professional" applications in a one-off scenario, even AMD falls short.

I'm honestly not sure what the moral of this story is.

throwawaymaths · on June 2, 2023

The moral of the story is that Nvidia invested a lot more in low level software developers for their GPU solutions and AMD did not, and it shows.

"Open source" by itself is not a magic dust you can sprinkle on your projfcts that will make your software work well.

williamDafoe · on June 2, 2023

AMD's focus was always on pure compute power at a good price. And they always beat NVidia at that game. AMD cards always had the highest hash rate per dollar in crypto mining. AMD has 100% of the console market and the fastest iGPUs by 2x over Intel.

NVidia decided to use gimmicks to sell their cards including texture compression, lighting tricks, improved antique video encoders, motion smoothing, bad proprietary variable refresh rate, ray tracing, cuda and now machine learning features.

Nvidia is fortunate that machine learning has taken off. That is masking AMD winning market share from weak overpriced NVidia 3D products!

Miraste · on June 2, 2023

You're mashing together a lot under "gimmicks" there.

Texture compression: Useful for games, ongoing work, although I wish they would make cards with appropriate amounts of VRAM

Lighting tricks: Not sure what this is referencing

Improved antique video encoders: NVENC started out with only h.264, but now it supports h.265 and AV1, which aren't antique at all. Niche, but widely used in the streaming industry.

Motion smoothing: The hardware optical flow accelerators in newer cards are important for DLSS, which is a bit gimmicky but works mostly as advertised.

Bad proprietary vrr: No argument here, gsync sucked.

Ray tracing: All 3d games are going to be ray traced sooner or later. Getting a head start on it is a good move, and it's a big head start. The 4090 is ~100% faster than the 7900xtx.

CUDA: No one can seriously call CUDA a gimmick.

Machine learning features: Tensor cores are great.

DiabloD3 · on June 3, 2023

CUDA is a gimmick though.

CUDA isn't a "technology", its a shader language that has been supplanted by better industry-wide standards.... the same standards whose shader languages are compiled by the same Nvidia shader compiler.

CUDA is a moat whose muddy waters has long since ran dry, and you're drinking koolaid if you think its still relevant for greenfield projects.

kcb · on June 3, 2023

> and you're drinking koolaid if you think its still relevant for greenfield projects.

So I want to start a new GPU compute project today. Obviously this will primarily be deployed to AWS/Azure/etc, which means only high-end GPUs available are Nvidia. What do you recommend developing this application with?

The way I see it, you would have to be drinking koolaid to use anything besides CUDA.

DiabloD3 · on June 3, 2023

Why do you think I can't use standard APIs on Nvidia? I literally just said same compiler does both; Nvidia sits on the Khronos committee! They co-wrote the API that everyone uses, that their compiler also speaks!

Miraste · on June 3, 2023

Vulkan Compute is not an alternative to CUDA. There's a reason PyTorch doesn't provide Vulkan in their official binaries. It's in the source, though--build it yourself, try running any recent ML project, and see what happens.

DiabloD3 · on June 8, 2023

Thats a weird strawman; compute in Vulkan is a replacement compute in OpenGL and legacy D3D, and as a twin sibling to compute in D3D12.

OpenCL is the actual intended replacement for all the pre-standard APIs, and has achieved its goals. If you want SPIR-V IR, OpenCL allows this and all the major vendor impls support it.

CUDA has no equivalent for SPIR-V, and never will. Nvidia's own internal IR is not, and never will be, documented nor stable across driver versions. This is a massive downside for ML middlewares, as they have no way of directly emitting optimal code that cannot easily be represented in the HLSL-flavored syntax in CUDA.

Our_Benefactors · on June 3, 2023

There exists no AMD alternative to CUDA. How is this a “gimmick”?

ynx · on June 3, 2023

> CUDA isn't a "technology", its a shader language that has been supplanted by better industry-wide standards

As someone who uses industry-wide standards in a related field...

The proprietary implementation often has the benefit of several more years of iteration with real products than the open standards. 'Supplanted' can only really be evaluated in terms of popularity, not newness or features, because features on paper aren't features in practice until they pay for their migration cost.

boppo1 · on June 3, 2023

>a shader language that has been supplanted by better industry-wide standards....

Are you talking about Vulkan? If so, I'm not sure 'supplanted' is the right word.

DiabloD3 · on June 3, 2023

OpenCL.

waboremo · on June 2, 2023

That's a wild perspective. I don't know how you can really come to that conclusion either. One attempt at getting Blender to render something using an AMD vs Nvidia card will paint a very very clear picture.

dotnet00 · on June 2, 2023

Calling features which are integral to all modern games and most of which also got adopted by other vendors 'gimmicks' is kind of ridiculous.

throwawaymaths · on June 3, 2023

You're entitled to your opinion (which I agree with in broad strokes) but with respect, the op article is specifically about ML. Calling cuda a "gimmick" is silly and completely underestimating the datacenter/ML cluster market share (it dwarfs consumer GPU), and fact of the matter is AMD's CUDA equivalent segfaults. So if "being actually usable to the biggest market" is a gimmick, so be it.

merb · on June 3, 2023

> AMD has 100% of the console market

thats not true, since the switch is based on an nvidia platform. since it's still 1/3 of the market, it's not as bad as it used to be.

PHGamer · on June 3, 2023

and yet amd lately has been quietly just been slightly less than nvidia but worse product. amd sucks thats just it. their market share is crumbling and nvidias is getting stronger because people are like fuck it, at that price might as well jsut buy the better one that Just works TM

newjersey · on June 2, 2023

I personally don't have any insider information but just wanted to add what your saying fits with the meta on the gaming community side where commentators are frustrated that nVidia has so much hubris that they think they can just sell essentially last generation level technology without the step up (I think it was 3xxx vs 4xxx or something like that where you'd expect the 4060Ti to be at least as good as 3070Ti) and just trying to make up for it in "software".

It probably takes a lot of confidence in your software developers to make this kind of decisions.

gary_0 · on June 2, 2023

A company that goes open source might get the icing for free, but they still have to bake the cake themselves.

beebeepka · on June 2, 2023

Are they "incredibly proprietary" compared to the competition? Clearly they aren't. Nvidia offers blobs in both consumer and professional markets. Even going to the extent of gimping performance hardware through drivers on more than one occasion.

That said, I think AMD isn't really competing with Nvidia. Sure, their R&D budget is smallish but it feels like they're somewhat fine with the current status quo.

paulmd · on June 2, 2023

> Nvidia offers blobs in both consumer and professional markets

So does AMD.

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

And while they have an open version of the userland, it's also missing features compared to the proprietary one, etc.

Besides, in the end it truly hardly matters whether the firmware is loaded at runtime or lives in updateable flash. It's still not "your PC" in the Stallman sense either way, it's been tivoized regardless of whether firmware is injected at runtime or during assembly. You cannot load unsigned firmware on AMD anymore either, firmware signing started with Vega (iirc) and checksums now cover almost all of the card configuration similar to NVIDIA.

Firmware is also the only way to get proper HDMI support... which is why AMD still does not support HDMI 2.1 on linux. HDMI Forum will not license the spec openly and implementations must contain blobs or omit those features.

https://gitlab.freedesktop.org/drm/amd/-/issues/1417

beebeepka · on June 2, 2023

Hey, I am not white knighting for AMD here. For all we know, they could only have been pursuing open standards because they've been forced to, as the underdog.

Can we really assign blame to them specifically for not fighting the hdmi forum on our behalf?

m463 · on June 2, 2023

Isn't this sort of how specialized hardware kind of works?

At some point, hardware (necessarily?) evolves to become optimized to do one thing, and then you have to just treat the driver as an API to the hardware.

Even "simple" things like keyboards and mice are now small computers that run their own code, moreso more complex devices like sound cards and hard drives.

And since graphics card performance seems to be the bottleneck in a lot of computing, it has become super specialized and you just hand off a high-level chunk of data and it does magic in parallel with fast memory and spits it out the hdmi cable.

AnotherGoodName · on June 2, 2023

For the keyboard/mouse now being small computers that's been true since the 1970s. Almost all keyboards for a period of about 30 years had an 8048 or 8051 CPU. It's how they serialized the keystrokes. From the model M keyboard through to everything up till the USB era.

kevin_thibedeau · on June 3, 2023

In the 70s it would be more common to have an MSI part that ran matrix scanning and spit out parallel bus ASCII. UCs were still spendy.

delfinom · on June 2, 2023

>own OS with its own trusted code solution

AMD is working on moving to things like the open source form of AGESA. They plan to start deploying openSIL by 2026.

amunicio · on June 2, 2023

> I'm honestly not sure what the moral of this story is.

That people will go with what is easier and works?

That open source and open standards don't win by default? That it takes a lot of persistence and effort.

JonChesterfield · on June 2, 2023

What OS do you mean? The closest thing I can think of is the embedded CPU that gets called CP in the ISA docs, which mostly schedules work onto the compute units. That has firmware which is probably annoying to disassemble, but it's hard to imagine it doing anything particularly interesting.

wmf · on June 2, 2023

The moral is that PSP FUD has nothing to do with AMD's lack of success in AI.

cypress66 · on June 2, 2023

Nah. AMD was already profitable in 2018. This is just big mismanagement.

Just having 30 extra good software engineers focusing on AI would have made such a massive difference, because it's so bad that there's a lot of low hanging fruit.

As someone who was pretty invested in AMD stock since 2018, it always made me pretty angry how bad they managed the AI side. Had they done it well, just from the current AI hype the stock would probably be worth 50 bucks more.

metaphor · on June 3, 2023

> Nah. AMD was already profitable in 2018. This is just big mismanagement.

Hindsight bias much?

How easily we forget in today's speculative AI bubble that AMD rolled into 2018[1] with 6.1x levered D/E and substantial business uncertainty while the Fed was actively ratcheting interest rates up, and ended the fiscal year still 3.3x levered despite turning operationally profitable[2].

> Had they done it well, just from the current AI hype the stock would probably be worth 50 bucks more.

It strikes me as pretty audacious and quite unconscionable to assert "big mismanagement" while simultaneously crying about speculative short-term profit taking opportunities.

[1] https://www.sec.gov/Archives/edgar/data/2488/000000248818000...

[2] https://www.sec.gov/Archives/edgar/data/2488/000000248819000...

xuki · on June 3, 2023

Hey the stock only 10x since 2018, we could do better, couldn’t we?

cypress66 · on June 3, 2023

Not really hindsight bias.

As someone who had like 25% of their portfolio in AMD, it was pretty infuriating being forced to buy Nvidia GPUs every single time because the AMD ones were literally useless to me (lack of AI support and cuda in general).

Yes, there's AI hype right now. But Nvidia gpu datacenter growth isn't new. And AMd were asleep

roenxi · on June 3, 2023

Not asleep; they just directed their efforts at things that haven't worked out. With their APU lines it looked like they wanted to integrate GPUs completely into the CPU - that was hardly asleep to the importance of GPU compute.

The problem they ran in to looks to me to be that they focused on targeting a cost-effective low end market and were caught off-guard by how machine learning workloads work in practice - huge burst of compute to train, then much lower requirements to do inference. That isn't something they were strategically prepared for and that isn't something that software industry has seen before either.

Won't save them from market forces, but their choices to date have been reasonable.

metaphor · on June 3, 2023

Look long and hard at AMD's financials circa 2015[1]...for the sake of anticipated TL;DR, here are a few summary highlights:

  - -27.5% YoY revenue decline
  - -6.3% YoY gross margin decline
  - -$481M operating loss
  - $230M short-term debt
  - $388M non-cancelable operating lease commitments
  - $538M unconditional purchase commitments
  - $2.032B long-term debt (!)
  - -$412M stockholders' deficit (!!)

Seriously, look long and hard at those numbers, and when you think you understand what they might mean, consider them again and again until the feeling of insurmountable adversity sinks in and you're on your knees begging public equity markets for an ounce of capital and a pinch of courtesy faith...on the promise of meaningful risk-adjusted ROIC to be delivered in just a few years.

> But Nvidia gpu datacenter growth isn't new. And AMd were asleep

...which is why this remark comes off as sheer arrogance (no disrespect).

Su and the rest of AMD leadership certainly weren't asleep. The difference here is while you're busy scouting speculative waters defended by competition with deep battle pockets and an even deeper technical moat, Su was simply preoccupied bringing a zombie company back to life and building up enough health to slay a weaker giant.

Personally, I was already beyond impressed with one miracle delivered.

[1] https://www.sec.gov/Archives/edgar/data/2488/000000248816000...

ksec · on June 3, 2023

>As someone who had like 25% of their portfolio in AMD

>Nah. AMD was already profitable in 2018. This is just big mismanagement.

I guess you know they have debt, and they were paying them off, and were battling with other issues all the way till 2019 / 2020 when Intel had their misstep so they could gain something in the CPU server market?

cypress66 · on June 3, 2023

Yes. And they still could have afforded 30 software engineers to work on ai/compute painpoints.

But let's asume they thought it was too expensive back then. There's still no reason not to invest in software in 2020 when their gross margin was absurd.

MontyCarloHall · on June 2, 2023

Yup. Have 15 of those software engineers contribute pull requests to PyTorch to make its OpenCL support on par with CUDA and take the other 15 engineers to do the same for TensorFlow and AMD would already be a serious contender in the AI space.

cma · on June 3, 2023

Does AMD have tensor cores?

alfalfasprout · on June 2, 2023

I'm not so sure anymore. The big reason is that now that the ML framework ecosystem has fragmented into different "layers" of the stack, very few people are directly writing CUDA kernels anymore.

As a result, with things like XLA now supporting AMD GPUs using RoCM under the hood the feature gap has closed A LOT.

Sure, Nvidia still has the performance crown lead with CuDNN, NCCL, and other libraries providing major boosts. But AMD is starting to catch up quite fast.

cameronh90 · on June 3, 2023

> it might be too late and the proprietary ecosystem has probably won.

Compiler ecosystems can and have changed rather quickly. Especially given that most NNs run on a handful of frameworks. Not _that_ many people are writing directly on top of CUDA/cuDNN.

Make an equivalent toolchain that runs on cheaper hardware and the migration would be swift.

Currently AMD hardware is a bit behind and the toolchain is frustratingly buggy, but it's probably not as big of a moat as NVIDIA are trading on. Especially since NV's toolchain isn't particularly polished either.

dotnet00 · on June 2, 2023

>AMD picked open standards and made investments on open source frameworks and libraries commensurate with their financials, the hope being that the community could help pick up some of the slack.

This has been their claim, but more often than not they haven't actually done anything to encourage the community to pick up slack. So many of their graphics tools have been released with promises of some sort of support or of working with the community yet have basically had nothing to help the community help them.

Even accepting the unreasonable idea that they can't afford the full-time developers for the various tools and libraries they come up with, they often don't even really work with the community to build and maintain those.

One of the bigger cases which contributed to turning me off from AMD GPUs was buying a 5700XT at launch, eager to work on stuff using AMD specific features, only to be led on for over a year about how ROCm support was coming soon, every few months they'd push back the date further until they eventually just stopped responding at all. Trying to develop on their OpenGL drivers was a similar nightmare as soon as you wandered off the old well worn paths to more modern pipeline designs.

Another glaring example would be Blender's OpenCL version of Cycles, which was always marred with problems and hacks to work around driver issues. They tried to work with AMD for years before finally just dropping it and going for CUDA (and thus HIP) even though AMD's HIP support, especially on Windows, is still in a very early state.

Dylan16807 · on June 2, 2023

They've been getting piles of money from Ryzen for 5-6 years now. How long am I supposed to wait?

According to the latest ROCm release notes, it supports Navi 21. Well, at least the pro models. It doesn't even mention the 5000 or 7000 cards. My current understanding is that 7000 support is mostly there a few months late and 5000 was abandoned partway done after years of vague promises.

At least it might support windows soon. Not my sub-4-year-old GPU, of course, god forbid. But most of the rest of them.

williamDafoe · on June 2, 2023

AMD wasn't very profitable until 2018. The company's debt to equity ratio was terrible (due to previous CEO mistakes 2000-2012) until they paid off their huge debts with Ryzen 3 in ~2020. Be patient, grasshopper ..

https://www.google.com/search?q=amd%20debt%20to%20equity%20r...

BeetleB · on June 2, 2023

> They've been getting piles of money from Ryzen for 5-6 years now

Hardware is very capital intensive. They've not been making much until much more recent. From 2012 through 2017, almost all years were a net loss. They hit $1B net profit only in 2020. I imagine quite a bit of that money went into keeping/accelerating the pace of Ryzen, and paying off debts. Only now do they have more breathing room for other endeavors. If they diverted a chunk of that change to AI, they probably would have a lower performing Ryzen right now.

So no, they did not have piles of money.

ChuckNorris89 · on June 2, 2023

>their equivalent of CUDA

Nvidia didn't pull their AI leadership out of thin air overnight, but they shipped the first CUDA capable consumer cards with the GeForce 8000 series way back in 2007 and committed to this ecosystem over the years, consistently investing in the HW, SW.

By the time AMD woke up and shipped ROCm in 2016, Nvidia already had nearly 10 years head start and a cemented moat in this field. AMD now has a huge mountain to climb to catch up to Nvidia.

bryanlarsen · on June 2, 2023

AMD invested significantly into OpenCL prior to 2016. It seemed like a safe bet -- the open industry standard usually ends up beating the proprietary standard in the long run.

Especially for something like this, with massive open source / open standard companies like Google as heavy users. It seems surprising to me that Google didn't ensure that open standards won in an area that they are so heavily dependent on.

belval · on June 2, 2023

If you ever get the chance to write something in OpenCL and then in CUDA I promise you will understand immediately why Google didn't push for it.

There is a lot more boilerplate to it, read/write to the buffer, queueing being handled explicitly. Here's an example that illustrate what I mean: https://github.com/rsnemmen/OpenCL-examples/blob/master/mand...

For comparison here is an implementation in CUDA: http://selkie.macalester.edu/csinparallel/modules/CUDAArchit...

Notice how the CUDA code is more readable.

SuchAnonMuchWow · on June 2, 2023

In the link provided, the CUDA example only show the compute kernel itself and not the boilerplate required to run it. On the other hand, your OpenCL example only show the boilerplate.

This is the OpenCL kernel from the same repo, for a more fair comparison: https://github.com/rsnemmen/OpenCL-examples/blob/master/mand...

This is much more readable. OpenCL-C the language is fine: it's how you deploy the program on the cards that is complicated with opencl.

bryanlarsen · on June 2, 2023

Google is part of the Khronos group. They were well positioned to steer the standard towards one that doesn't suck. Or they could have championed a different standard. Google has the scale that only Google is to blame that they are still heavily dependent on a closed standard.

Open Standards almost always beat closed ones. IMO AMD was right to bet on open standards. They lost the bet but I think it was the right bet.

barbariangrunge · on June 2, 2023

> the Khronos group

I still wake up at night, sweating, thinking about my time learning OpenGL in university. The docs were a nightmare back then

shaklee3 · on June 3, 2023

openCL != openGL

ChuckNorris89 · on June 2, 2023

Because all the researchers that used GPUs for CV and ML used what they had at their disposal, which was Nvidia GPUs and CUDA.

OpenCL brought no advantage here considering it only worked on AMD GPUs which were lackluster in performance and switching from CUDA to OpenCL meant extra work that researches already iterating on CUDA weren't willing to do.

DeepYogurt · on June 2, 2023

OpenCL did (and still does I think) work on nvidia cards. People I talked to back in the day complained more about OpenCL being "C but on GPUs" while cuda was more akin to C++. They could move faster and do more in cuda and the nvidia lock in didn't matter as the fastest cards of the day were nvidia. I think vega cards were faster (or faster per dollar maybe) for some of the code that was relevant, but not by much and by that point legacy code lock in had taken over.

cibyr · on June 2, 2023

OpenCL works fine on Nvidia GPUs. It's more annoying to code for, and harder to get good performance out of than CUDA, but it works.

dragontamer · on June 2, 2023

Poorly, in my experience.

CUDA is compiled into PTX, an intermediate language. PTX is then compiled into a specific NVidia assembly language (often called SASS, though each SASS for each generation of cards is different). This way, NVidia can make huge changes to the underlying assembly code from generation-to-generation, but still have portability.

OpenCL, especially OpenCL 1.2, (which is the version of OpenCL that works on the widest set of cards), does not have an intermediate language. SPIR is an OpenCL2.+ concept.

This means that OpenCL 1.2 code is distributed in source and recompiled in practice. But that means that compiler errors can kill your code before it even runs. This is especially annoying because the OpenCL 1.2 compiler is part of a device-driver. Meaning if the end-user updates the device driver, the compiler may have a new bug (or old bug), that changes the behavior of your code.

-------------

This doesn't matter for DirectX, because like CUDA, Microsoft compiles DirectX into DXIR / DirectX intermediate language. And then has device drivers compile the intermediate-language into the final assembly code on a per-device basis.

-------------

It is this intermediate layer that AMD is missing, and IMO is the key to their problems in practice.

SPIR (OpenCL's standard intermediate layer) has spotty support across cards. I'm guessing NVidia knows that PTX intermediate language is their golden goose and doesn't want to offer good SPIR support. Microsoft probably prefers people to use DirectX / DXIR as well. So that leaves AMD and Intel as the only groups who could possibly push SPIR and align together. SPIR is a good idea, but I'm not sure if the politics will allow it to happen.

JonChesterfield · on June 2, 2023

It's really difficult to tell whether the PTX layer approach is something AMD _should_ adopt. That's roughly what the (I think now abandoned) HSAIL thing was.

It's one where packaging concerns and compiler dev concerns are probably in tension. Compiling for N different GPUs is really annoying for library distribution and probably a factor in the shortish list of officially supported ROCm cards.

However translating between IRs is usually lossy so LLVM to PTX to SASS makes me nervous as a pipeline. Intel are doing LLVM to SPIRV to LLVM to machine code which can't be ideal. Maybe that's a workaround for LLVM's IR being unstable, but equally stability in IR comes at a development cost.

I think amdgpu should use a single llvm IR representation for multiple hardware revisions and specialise in the backend. That doesn't solve binary stability hazards but would take the edge off the packaging challenge. That seems to be most of the win spirv markets at much lower engineering cost.

KeplerBoy · on June 2, 2023

OpenCL also gets compiled to PTX on Nvidia GPUs.

dragontamer · on June 2, 2023

But as an OpenCL programmer, you don't distribute PTX intermediate code. You distribute OpenCL kernels around and recompile every time. That's more or less the practice.

KeplerBoy · on June 2, 2023

True.

And the resulting PTX is worse when it's generated from OpenCL C instead of CUDA C. I tested that recently with a toy FFT kernel and the CUDA pipeline produced a lot more efficient FMA instructions.

Keyframe · on June 2, 2023

Nvidia took a big gamble with CUDA and it took years and ton of investment to get there. Jensen Huang talks about it on commencement speech he did in Taiwan recently here https://www.youtube.com/watch?v=oi89u6q0_AY It's a big moat to cross.

mahkeiro · on June 2, 2023

I always had hope that ROCm will be able to compete with CUDA but it’s nowhere here despite the time. Seems funny to see that Intel is doing a better job at that with OneAPI.

osti · on June 2, 2023

This is my thought as well. Their devs who work on the graphics drivers are heavily underpaid in Canada. As an ex-AMD, I took AMD's offer for 20k less than another software company because I like low level stuff and have always been an AMD fan since the Athlon days. But when Amazon offered me double, I easily took Amazon's offer and left AMD after less than a year.

I think a lot of companies are overpaying their software people, but if there is any one that should pay their devs much more it would be AMD, because they are in a position to compete against Nvidia if their software integrated well with the AI training stuff.

jrockway · on June 2, 2023

Why do you think that people are being overpaid if you literally left a job you liked to make more money? Being underpaid is just being underpaid; just because the numbers are high at the big tech companies doesn't mean that those companies are overpaying. They're probably underpaying! The dollar just isn't worth what it used to be.

(Canada does chronically underpay its software engineers, though.)

osti · on June 2, 2023

Not going to argue whether we are being overpaid or not (I certainly hope we aren't because now I'm getting even crazier compensation than my Amazon days lol). But I think the current layoffs which are putting downward pressures on salaries will prove that we were getting overpaid.

My main point was still that AMD should really pay a lot more than what they currently are paying, they actually already increased it quite a bit compared to 3 years ago, but not nearly enough! In a way, I think this reflects poorly on Lisa Su because she didn't invest enough into AI while it should have been obvious from the start.

superkuh · on June 2, 2023

The AMD RX 580 was released in April 2018. AMD had already dropped ROCm/HIP support for it by 2021. They only supported the card for 3 years. 3 years. It's so lame it's bordering on fraudulent, even if not legally fraud.

I know CUDA via their HIP is a moving target they don't control making it hard and expensive for them to prevent bit rot but this is still an AMD caused problem due to opencl not getting any love by anyone anymore. AMD included.

paulmd · on June 2, 2023

Also, while AMD's OpenCL implementation has more features on paper, the runtime is frequently broken where NVIDIA's claimed features actually all work. Everything I've heard from people who've used it is that they ended up with so much vendor-specific code to patch around AMD's bugs and deficiencies that they might as well have just written CUDA in the first place.

This is an old article but the old "vendor B" stuff still rings incredibly true with at least AMD's OpenCL stack as well.

https://richg42.blogspot.com/2014/05/the-truth-on-opengl-dri...

Thus NVIDIA actually has even less of a lock-in than people think. If you want to write a better OneAPI ecosystem and run it on OpenCL runtime... go hog wild! NVIDIA is best at that too! You just don't get the benefit of NVIDIA's engineers writing libraries for you.

paulmd · on June 5, 2023

btw as far as anyone insisting "ROCm is good now"... see a practical example in action.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

JonChesterfield · on June 2, 2023

I think Intel is still pushing opencl on GPUs. Maybe with other layers on top. Sycl or oneapi or similar. AMD mostly shares one implementation between hip and opencl so the base plumbing should work about as well on either, though I can believe the user experience is challenging.

I wrote some code that compiles as opencl and found it an intensely annoying experience. There's some C++ extension model to it now which might help but it was still missing function pointers last time I looked. My lasting impression was that I didn't want to write opencl code again.

JonChesterfield · on June 2, 2023

You dont have to change to a new toolchain. If the 2021 version was working with your card, keep using it.

sangnoir · on June 2, 2023

Sticking to an old kernel and libraries is a pain id your hardware is not purpose-specific. Newer downstream dependencies change and become incompatible: e.g. Tensorflow 2 (IIRC) was incompatible with the ROCm versions that work with the 580. New models on places like HuggingFace tend to work with recent libraries, so not changing to a new toolchain locks you in to SoTA a few years in the past. In my case, thr benchmarking I did for my workloads showed comparable perf between my RX580 and Google Colab. So I chose to upgrade my kerbel and break ROCm

JonChesterfield · on June 2, 2023

Yeah, that's fair. Staying in the past doesn't work forever.

There are scars in the implementation which suggest the HSA model was really difficult to implement on the hardware at the time.

It doesn't look like old hardware gets explicitly disabled, the code that runs them is still there. However writing new things that only work on newer hardware seems likely, as does prioritising testing on the current gen. So in practice the older stuff is likely to rot unless someone takes an interest in fixing it.

emmender · on June 2, 2023

the range of technology that needs to come together for ai training is underestimated. there is cuda of course, but there is also nccl, infiniband, gpudirect, each of which requires years of sw and hw maturity. unlike the cpu which has a clean interface (instruction set) the gpu has no such thing - it is more like an octopus with tentacles into networking, compute, storage etc.

amelius · on June 2, 2023

> The trouble is that AMD just didn't take AI seriously.

No worries, AI is not very complicated tech. It's just a core that can do arithmetic (something AMD already knows how to do very well) copied a very large number of times on a chip, plus some interconnect.

CPUs with all their speculative execution and random memory access patterns are much more complicated.

dotnet00 · on June 2, 2023

AI is more than just the underlying math. The software ecosystem is very important, which is what NVIDIA's lead is built on. AMD has a very hard time providing an "it just works" type experience in the way that NVIDIA offers these days.

Machine learning engineers (or most people writing GPU code) do not typically have the time, knowledge or interest to diagnose driver issues and beg AMD engineers to address them in a reasonable time frame.

croes · on June 2, 2023

AI is only one use case of CUDA and at first it wasn't the main use case.

starkd · on June 2, 2023

That would explain my Radeon Graphics card I never managed to get working properly. It arbitrarily froze. I was told that it did that for Linux and that it was guaranteed to work on Windows. But when I tried it on Windows, it did the exact same thing. They were unresponsive.

tracker1 · on June 2, 2023

I think that AMD needs to really push ahead on two fronts. The first being price/performance. They need to do much more than just being a few percent ahead of NVidia on price. They need moderate cards that have 48gb vram at under $2k that are competitive to the 4090. That's only half the battle, because said cards need to compete with top NV cards for gaming, just so that people will buy them for play and stay to dev with.

The other front is developer experience and tooling, NVidia is way ahead on this front and entrenched. They need cleaner integrations and abstractions for OpenCL. This should probably include clean support for Python tooling as well as for Rust targets. The former being massive for education and common use space and the latter being for those that want to eek out performance without necessarily using C. Both of which will mean more community involvement and investment that lasts longer than AMD is typically known for.

If AMD targets mainly support for Linux tooling, then is should/must also support WSL for windows users. No idea where Mac is headed in terms of expansion boards on M2 or future gen. But they definitely need to expand the user base with good, relatively cheap higher end cards as well as devex.

Edit: the top end mentioned at 48gb is just for top consumer comparison... I think good tooling for 16-24gb cards in the $500-1200 space that is gaming competitive and can handle AI experimentation and workstation workloads would go a long way as well.

angm128 · on June 2, 2023

Totally agree on that, securing a piece of the AI market will be a huge challenge. No one will buy AMD for AI when the software isn't compatible and no one will buy AMD for AI to get wonky software for the same price as NVIDIA.

Affordable cards with lots of memory and good software support is the only solution to maybe get into the market. Double the memory for some gaming cards (just like NVIDIA's 3060 12GB and the 4060 TI 16GB)

Additionally data center products with enormous vram amounts and very fast interconnects will be important

AMD can't even take orders that NVIDIA can't keep up with because they are producing on very similar nodes. NVIDIA can easily outspend AMD for TSMC production capacity

FuriouslyAdrift · on June 2, 2023

The Instinct line is for AI and the current top of the line has a 128GB of HBM3 ram on die. I am assuming it way north of $2k, though. https://www.tomshardware.com/news/amd-instinct-mi300-data-ce...

tracker1 · on June 2, 2023

Yeah, but my meaning is to offer a good to great value gaming card that can do a job of getting feet wet in the AI side. I don't think the bulk of people dipping their toes into AI on NVidia/Cuda are using their really expensive cards. AMD needs to win over the hobbiest and SOHO workstation types.

The same types that will run a 5950x/7950x for 16-cores without jumping to threadripper or server parts are the same ones that are playing with Cuda on 3080/4080 class hardware. This drives the market in open-source and prosumer into the professional path.

angm128 · on June 3, 2023

MI300 looks very promising, but it doesn't instantly solve the chicken-egg problem. Nvidia has been placing eggs in the form of "cuda on gaming gpus" in the nests of researchers and students for quite some time.

"Nobody ever got fired for buying nvidia" might be the situation in the future if AMD doesn't manage to get a more popular choice among developers

GeekyBear · on June 2, 2023

AMD re-hired Jim Keller in 2012, and his team started development of AMD's Zen cores (that are the foundation of Ryzen/Epyc) before Dr. Su came on board.

Credit for the decision to put Keller back in charge of AMD's CPU core design goes to AMD CTO Mark Papermaster.

I would give the launch of products based on Zen as much credit for AMD's turnaround and present success as I would give Doctor Su.

viewtransform · on June 2, 2023

When Dr Su took over - there was no coherent product roadmap at AMD. There were various headless zombie projects because of a management exodus around 2012.

Dr Su made some decisive calls to stop projects and placed a prescient long term bet on high-performance computing.

She shut down low-power tablet asic designs, shut down the SeaMicro acquisition, shutdown Keller's K12 ARM chip, shutdown a large monolithic CPU/GPU asic with shared memory and planned a shift from Global Foundries to TSMC.

Deep Learning was not on the radar unfortunately. It fell under Raja Koduri's group and he made an unfortunate bet on Virtual Reality (way before Zuckerburg started his metaverse fantasy).

It the end it helped that Intel stumbled badly allowing AMD to recover financially.

pmarcelll · on June 2, 2023

Dr. Su joined AMD in January 2012 as senior vice president, before Jim Keller was re-hired (in August 2012).

tormeh · on June 2, 2023

AMD's AI software efforts are so trash WebGPU will be the first API you can use to run PyTorch on consumer AMD GPUs, not because AMD put some effort in, but because people want to run PyTorch in the browser. AMD could have ported PyTorch to OpenCL. Could have ported it to Vulkan. Instead they've made their own shitty version of CUDA that works on like 3 of their professional cards and nothing else. Maybe I don't understand something. That's the only sane explanation for what I think is completely baffling behavior.

bavell · on June 2, 2023

Uhhh, I've been running stable diffusion on my 6750XT for a few months now... just started playing with chat and text models (oobabooga) too.

Didn't even get the card thinking I could use it with ROCm for hobbyist AI stuff but turns out it works great and having 12GB of VRAM is super nice.

paulmd · on June 2, 2023

> Instead they've made their own shitty version of CUDA that works on like 3 of their professional cards and nothing else. Maybe I don't understand something. That's the only sane explanation for what I think is completely baffling behavior.

AMD is laser-focused on surgically tapping high-margin markets. There is money in HPC, let's support that. There is money in AI, let's support that. Nothing more nothing less.

The way to think about ROCm isn't a platform like CUDA, it's as an embedded processor that gets engineered into some other product. It doesn't matter if there isn't a good general-purpose OS and ecosystem etc for Zilog Z80 - we aren't making a computer, we are making a microwave, it only ever needs to run one specific piece of software (or a small handful). And that's what AMD has become, a processor that goes into someone else's platform rather than a platform in itself. We are building a HPC supercomputer, we are building an AI training platform, and for that specific product AMD might offer the best value for performance. But you're fundamentally someone else's platform.

The ironic thing is that's exactly what everyone is implying that NVIDIA might be doing now with AI, and nothing could be farther from the truth. Abandoning gaming/graphics and focusing AI would be a one-way road to the same situation AMD is in. Once you've abandoned that virtuous cycle it's tremendously hard to get back. NVIDIA has always been laser-focused on making sure that innovation happens on their platform, making sure that prosumers can write CUDA on their gaming hardware and grad students can write their thesis in CUDA and so on. It's not that NVIDIA keeps accidentally falling into success, they're deliberately putting themselves there, and they're not going to stop because of AI or anything else. If they stop and chase AI to the exclusion of graphics, that spigot will dry up, and they will no longer be in the position to catch the next fountain of money when it happens.

Besides, all their other products center around graphics anyway - do you sign a big multiproduct partnership with Mediatek or Nintendo if you don't have a good gaming IP, and DLSS, and wide adoption of that software? Does Blender integrate the next OptiX if none of the userbase can run it? Do you just make quadros and not do the single last step (gaming drivers) and forego that revenue because it's not enough? No, that's crazy.

Platform is hugely important to NVIDIA. It's their core product. Jensen told everyone 15 years ago that NVIDIA was a software company and people scoffed. They're not just a company that writes software, it's almost their primary product really. They write the software that sells the hardware. DLSS and AI and CUDA are their products, and they just sell you the fuel to run their product. Razor-and-blade model in action. "What if they just stopped selling razors" ok then in the long term you won't sell many blades, will you?

They're just not going to do it at zero cost or a loss. And just like the Radeon 7850 has no real successor in the $150 segment, that is creeping higher and higher in the stack as fixed costs overwhelm the progress now that moore's law is dying.

The true threat to NVIDIA's platform is the rising costs of low-end products. Fixed assembly/testing/shipping costs, fixed die area overhead for memory PHYs, increased VRAM needs (without drops in the per-GB cost of VRAM) - are gradually sapping the low-end segment. It is already not possible to make a $100 or $150 GPU that's very compelling, you can equivocate about whether a $200 or $300 GPU could be better for the price but nobody is making a good $100 or $150 GPU that's a worthy enthusiast-tier successor to the Radeon 7850/R7 270 or similar, because it's just not possible, and everyone agrees on that at least. The RX 6500XT and similar products are never going to compel anyone to upgrade, that segment has gone terminal.

And that threshold is creeping higher every time they shrink because people want more memory and GDDR density hasn't kept up and PHYs don't shrink. There is de-facto a "minimum die size that is worth it" because of the fixed size of PHYs (to get the fixed amount of memory people want) and in a world of spiraling cost-per-mm2 that means there is a minimum cost that's worth it, and it's inching higher and higher ever time you shrink. Like try to even imagine what the $200 segment is going to look like with RDNA4 - are they going to launch a 8500XT 16GB on N5P or N3 at $200, just a tiny sliver of compute area sandwiched between PHYs? No, they can't do it either.

Console-style or Apple-style APUs are the way out and that is a market that NVIDIA doesn't control, and that is the primary long-term threat to NVIDIA's platform.

But for now - they are an incredibly powerful accelerator and everyone reaches for their software when they have a hard task. People get super upset when they can't upgrade the platform for NVIDIA at a compelling performance increase every 2 years. Why would you ever give that up willingly? Not even AMD wants to be where AMD is.

kamikaz1k · on June 2, 2023

Well, George has an acquihire exit option.

> The goal of the tiny corp is: “to commoditize the petaflop”

> ... If we succeed at this project, we will be on the cutting edge of non NVIDIA AI compute. We have the ability to make the software, and that’s the hard part.

[1] https://geohot.github.io//blog/jekyll/update/2023/05/24/the-...

paulmd · on June 5, 2023

Upon seeing the absolute state of ROCm he has decided he will not be using AMD hardware after all.

He is now saying that his users would probably want the installer to work and the demo apps to not crash the IOMMU on a supported OS/kernel/hardware configuration? Who knew.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

kmeisthax · on June 2, 2023

Wouldn't being hired by AMD violate geohot's settlement agreement with Sony from a decade and change ago? AFAIK he basically agreed to never touch anything with the words "PlayStation" on it.

klooney · on June 2, 2023

Maybe he could take a golden parachute as a part of the acqui-hire.

alecco · on June 2, 2023

HN thread https://news.ycombinator.com/item?id=36065175

FloatArtifact · on June 2, 2023

Well AMD really needs to work on its software stack to support AI.

moffkalast · on June 2, 2023

If AMD throws twice the VRAM onto their cards they'd be a no brainer for that regardless. OpenCL is pretty well supported now.

paulmd · on June 2, 2023

AMD is trying to get people to pay $5000 for a workstation card too. The days of Radeon VII being $699 are long past.

Part of it is that as you shrink, the PHYs don't shrink much, so everyone is incentivized to minimize the number of memory channels and reduce the PCIe bus size/etc on lower tier products. And in turn, since GDDR6 tops out at 16 Gbit (=2 Gigabyte) per chip, that means a 4-PHY card tops out at 8GB, etc.

And while you can do clamshell... you want to be selling those cards to workstation users, not giving them away to gamers! It is the same problem NVIDIA faces, the fact that GDDR density has not increased leaves them with one single move (clamshell) and they've traditionally reserved that for workstation cards (and 3090) to increase margins.

williamDafoe · on June 2, 2023

Rx 7900xt is $760 and it's an incredible bargain now, IMHO ...

paulmd · on June 2, 2023

Yeah if you don't need CUDA the 7900XT is better right now than people give it credit for. People are super mega butthurt about prices right now, to probably an unreasonable degree, and they're ignoring some of the actually decent options that exist.

4070 at $600 (rip microcenter steam GC deal) is a pretty ok deal too, much better than people give it credit for. For less than a 6950XT you get 4GB less VRAM but it pulls 200W less power which is very noticeable, and gets DLSS2 (even if you don't like DLSS3!) which is significantly better at 1440p and 1080p, which is a big perf and perf/w boost, with better quality than FSR2. Even HUB now likes the 4070 over 6900XT for a generalist kinda build: https://youtu.be/Iy3ikm8MxOM?t=875

Or yes, the 7900XT on top, the launch MSRP sucked but $700-750 is OK for what you get. If you are otherwise getting 6900XT/6950XT I'd probably just recommend spending up and getting the 7900XT (or 4070), they really are a lot more efficient and have better featureset etc. It'll be worth it, suck it up and do it. 6800XT makes sense at like $450, that'd give it space vs the 4070, but people are getting irrational over the whole situation.

3090 is also an underappreciated competitor. What if there was a 3080 Ti 24GB, with DLSS2 and stuff but not $1600? There is, it's $700 on ebay. However, I am uncomfortable with the VRAM on the back with how hot GDDR6X runs, a lot of those cards mined for a lot of years... if you can get 3090 Ti it has only VRAM on the front, or get an evga one, or something. It's ok, at $700 it's a similar proposition to 6900XT/6950XT at $600 but a bit more VRAM and you get DLSS2.

6700/6700XT for $300 is a screamer of a deal and it's not really surprising AMD can't beat it. They're in the "1080 Ti vs 2070" situation, they over-cut on the old stuff and the new stuff can't really edge past it nor is the value great against a deeply cutdown older card (6700XT is a $480 card for $300!). It's not gonna last forever, if that's the featureset you want I'd consider buying. There will probably be a decent replacement eventually but that's clearly the value peak of what they can do with RDNA2 and it makes RDNA3 look poor in many ways. The successor in this price segment is, best case imo, 7600 16GB at $299-329, and that's a bit slower and more limited in a lot of ways, and not really more efficient either. N32 may not really compete favorably with it in either cost or perf/mm2.

4060 Ti is junk but 4060 8GB at $299 for basically 3060 Ti performance (4060 Ti is closer to 3070 at 1080p and 1440p) is reasonable imo, and DLSS will put it clearly over the top of the 7600 at 1080p (FSR sucks). I'm guessing AMD has to get $279 down to $249 or $229 by the time the 4060 launches. I'm guessing 7600 16GB will probably launch at $329 and drop to $299, and that'll be a decent option vs 4060 8GB too (and cost viable). Probably there will be a 7500XT 8GB or 6GB cutdown at $199 (12GB at $250?), doubt they can go too much below that (die cost isn't the problem/cutdowns don't help memory bus size). It'll be a bit slower than a 6700XT for sure, and not all that much more efficient, and you go down to a PCIe x8 bus, but it does have some newer stuff.

I think the 16GB versions of both 4060 and 4060 Ti are DOA, the 4070 is a lot faster and has enough. 6800XT is potentially still compelling in this segment too.

AMD really needs to figure their shit out with the N32 die though. They actually do need a competitor to 4070 besides just RDNA2 rebrands. I think with the unexpectedly (apparently) poor performance of RDNA3 it's just not worth it, like I just haven't heard any rumor mill shit about N32 at all. The MCDs alone would use as much 6nm silicon as a whole 7600 and then you have a chunk of N5P too. With how short wafers are for other products (IO dies, Epyc, N31, etc) it may just not be worth bringing N32 to market. Who knows, but, I'm getting more and more curious.

Numbers: https://www.reddit.com/r/hardware/comments/13vm5ti/geforce_r...

Tim from HUB talks candidly about FSR: https://www.youtube.com/watch?v=ycXkvVfc2yw&t=2541s

(the quality difference at 1080p and 1440p is significant, and that's where $200-300 cards will be running. Even in the 4070 segment... starting to become an issue for 7900XT etc)

Tepix · on June 3, 2023

I wouldn't call the 3090 at $700 underappreciated. Look at any number of AI projects on github: All numbers you will find mention testing/running the models on one or two 3090s. Among people doing AI stuff it's pretty much the standard consumer card.

polski-g · on June 2, 2023

ATI drivers have been terrible for over 20 years.

viewtransform · on June 2, 2023

Have you revisited it recently in the last 2-3 years? OpenGL, Vulkan and DX12 drivers have been completely rewritten from scratch and share a common hardware abstraction layer. They pass all Khronos conformance tests, OEM and ISV certification tests and have similar performance as Nvidia on SPEC benchmarks.

fulafel · on June 2, 2023

They were at some point but they're better than NVidia now for normal desktop etc use at least on Linux. They went open source and upstreamed the drivers, like Intel.

On the GPGPU stack front it may be different but CUDA is also really low level and abstracted away by ML stacks. And some of them also had OpenCL / SYCL backends at one point, I wonder what's the story there.

mikepavone · on June 2, 2023

Unfortunately, the ROCm stack seems to use a fair bit of separate kernel code and it's not nearly as stable in my experience. I have a 5700 XT (not officially supported by ROCm, but close enough to some other cards that some have had success). It's perfectly stable for demanding games, but just straightforward DMA is enough to cause GPU hangs with ROCm for some reason.

bryanlarsen · on June 2, 2023

But on Linux they're superior to NVidia's closed source garbage.

angry_octet · on June 2, 2023

How so? You mean graphics or compute?

The whole concept of the way graphics drivers works is bonkers, thanks to IP issues and game patching, but in production the NVIDIA drivers are very reliable and sophisticated.

bryanlarsen · on June 2, 2023

From my perspective as an admin of several GPU clusters.

angry_octet · on June 3, 2023

You have GPU clusters with AMD devices?

williamDafoe · on June 2, 2023

Obviously you stopped buying AMD products 4Y ago when Raj Koduri left. Without his corrosive effect, many reviewers are saying AMD adrenaline tools are NOW MUCH BETTER than NGreedia ...

dpflan · on June 2, 2023

Indeed, are you aware of the current state of AMD's equivalent to CUDA? And how far behind would be, and what would need to be done to near equivalence? It seems like the opportunity and differentiator. I wonder how Apple silicon software is doing...

llm_nerd · on June 2, 2023

Instead of trying to integrate the whole stack of, say, pytorch, Apple's primary approach has been converting models to work with Apple's stack.

https://github.com/apple/coremltools

Clearly no one is going to be doing training or even fine tuning on Apple hardware at any scale (it competes at the low end, but at scale you invariably will be using nvidia hardware), but once you have a decent model it's a robust way of using it on Apple devices.

coredog64 · on June 2, 2023

I’ve tried a few models and none have worked. It’s not that they need more resources, just that it freezes and then dies with an inscrutable stack trace somewhere in the OS. If someone from AMD sees the parent comment, please don’t copy Apple!

dpflan · on June 2, 2023

Mind sharing which ones and your approach? Haven't there been shared posts of success to follow?

Also, yes, anyone from AMD, I think we all want more variety in the market, what do you need help with?

llm_nerd · on June 2, 2023

Not every model feature and op is supported (though it is growing with every release), but personally I've had surprising success with it. It has allowed me to leverage some models efficiently on both Intel, Apple Silicon and iPhone/iPad devices.

I mean...the number of people using PyTorch models ported to CoreML is probably several magnitude greater than the number of people actually having success with AMD's initiatives in the space thus far...

lvl102 · on June 2, 2023

I don’t think Nvidia competition is coming from AMD. It’s way too late for that to happen. MSFT/GOOG/META will have their hands dirty now and AMD might be left watching. Not to mention what Apple has in store for the next few years. The best thing for AMD would be to partner up with MSFT. CUDA/PyTorch is here to stay for a very long time.

UncleOxidant · on June 2, 2023

Neither AMD or Intel have taken ML seriously enough. GOOG has TPUs but those are pretty much only for Google. Meta may be building their own ML accelerator chips as well, but again, those will likely stay inside of Meta.

> CUDA... is here to stay for a very long time.

Yes, this unfortunately seems to be the case. It would've been great to have more competition in this space. CUDA is closed source which sometimes leads to issues, but it works and is well supported by Nvidia and still has the first-mover advantage. I still have some hope for OneAPI (from Intel) but I'm not holding my breath.

FooBarBizBazz · on June 2, 2023

Buried in the article:

> [Jensen] Huang, who is a distant relative of [Lisa] Su’s [...]

This is an excellent, Dickensian level of coincidence. And they both come on stage in leather motorcycle jackets.

May this rivalry of -- what, cousins nth removed? -- bring us great new chips.

For AMD though, the software side is going to be important. OpenCL was a total bust, Cuda is king. They need to do something about that.

aidenn0 · on June 2, 2023

If this comment[1] is correct, they would be first cousins, once removed. Their closest common ancestor would be Huang's grandparents, which makes them first cousins, and they are one generation different from each other, when tracing from that ancestory, which makes them once removed.

1: https://news.ycombinator.com/item?id=36164991

klyrs · on June 2, 2023

I had to look... she's wearing short-sleeved leather jacket. Not all leather jackets are motorcycle jackets. Wearing short-sleeved anything on a motorcycle is senseless.

angry_octet · on June 2, 2023

No-one think Jensen is getting on a motorcycle, there is an implied motorcycle-style. And you can have short sleeve/ sleeveless motorcycle jackets, I've seen idiots wearing them many times.

klyrs · on June 2, 2023

I've seen people wearing short-sleeve t-shirts on their motorcycles too. That doesn't make them motorcycle shirts.

angry_octet · on June 2, 2023

Unfortunately the black turtleneck look has baggage.

andrewstuart · on June 2, 2023

If Lisa Su wants Nvidia's crown then she's going to have to actually compete.

That means creating the most awesome products possible at the lowest price practical.

In GPU's, AMD is doing the exact opposite.

And winning GPUs is what will allow winning AI.

Put another way, AMD simply does not compete with Nvidia - it trails along behind, trying to match the Nvidia products in specs, and being slightly less ridiculously expensive.

AMD GPUs are overpriced and AMD's most recent GPUs - the 7600 is garbage - this is not a strategy that is going to win any crown.

JonChesterfield · on June 2, 2023

It's a capability war. You don't need to be cheaper, you need to better.

andrewstuart · on June 2, 2023

AMDs GPU tech is close enough to Nvidia.

AMDs latest card the 7600 did not need to be garbage - this is a low end card so Nvidia can easily make it faster. But it very closely matches Nvidias card at the same level, the 7060, which is also garbage. AMD is simply copying Nvidia.

And you’re wrong…. People buy cheaper. It’s not the only factor but it’s a major factor.

JonChesterfield · on June 2, 2023

Nope. If you can run a language model on Nvidia that lets you fire a load of people, and you can't run it on amdgpu, AMD aren't going to sell you any hardware. Even if it was cheaper.

Even when you can run on both, electricity costs can dominate capital. I remember reading something that amounted to xeons cost so much more to run than epyc chips that even if the xeon chips are given away free it's cheaper to buy the epyc ones, assuming they'll be running for more than some smallish number of months.

sergiotapia · on June 2, 2023

It's pretty funny that Lisa Su is related to Jensen Huang. Like McDonalds and Burger King being owned by different brothers.

dchftcs · on June 2, 2023

It's not very surprising they'd be distant relatives, as a matter of odds.

Taiwan is a small place and Taiwanese Han people came from a relatively non-diverse group of immigrants (e.g. largely Hokkien). For these two to have got a good education and risen to the top like this as immigrants to the US, likelier than not they were from educated or rich families.

If you filter for Taiwanese families educated or rich during the years they were born, the scope would probably have been somewhat small and covered an even smaller range of bloodlines, possibly even all somewhat related to a handful of historical clans.

The degree of separation between people like them from a small geographical region tends to be fairly small in general.

KeplerBoy · on June 2, 2023

Maybe the bigger surprise is that both companies are led by Taiwanese Americans.

What's the story behind the semiconductor industry being centered at Taiwan of all places?

claritise · on June 2, 2023

Mostly government incentives... the taiwanese government had amazing foresight when most other countries didn't and was well positioned to attract their nationals back from the USA (at the time intel / TI veterans) to build out taiwain's hardware manufacturing sector.. which eventually led to the founding of TSMC.. and the rest is history. Right time and place, but more importantly, a government with amazing judgement and foresight.

peterfirefly · on June 2, 2023

> Like McDonalds and Burger King being owned by different brothers.

Adidas and Puma.

voodoomagicman · on June 2, 2023

and Trader Joes / Aldi

tootie · on June 2, 2023

Adidas and Puma were founded by rival brothers too.

ChuckNorris89 · on June 2, 2023

She's not related to him, that was some myth that spread online but it's not true, please stop spreading it without citing concrete sources.

sergiotapia · on June 2, 2023

> Technically, it is safe to say that Lisa Su's own grandfather is actually Jen-Hsun Huang's uncle. Although they aren't really niece and uncles, they are very close relatives.

https://www.techtimes.com/articles/253736/20201030/fact-chec...

polski-g · on June 2, 2023

Barack Obama and Donald Trump are 42nd cousins. Everyone is related.

saiya-jin · on June 2, 2023

Some decade and a half ago my now ex gf who is microbiologist told me that mankind at one time in the past went through some drastic filter and that all humans alive are descendants out of just 5 mothers. And they could have been related too in the past. So yes we are all one big family, not that it helps with anything

delfinom · on June 2, 2023

https://en.wikipedia.org/wiki/Mitochondrial_Eve

>One common misconception surrounding Mitochondrial Eve is that since all women alive today descended in a direct unbroken female line from her, she must have been the only woman alive at the time.[45] However, nuclear DNA studies indicate that the effective population size of the ancient human never dropped below tens of thousands.[49] Other women living during Eve's time may have descendants alive today but not in a direct female line.[50]

chasil · on June 2, 2023

There is less genetic diversity in the human species because of the evolutionary bottleneck that occurred.

https://en.wikipedia.org/wiki/Population_bottleneck#Humans

NayamAmarshe · on June 2, 2023

I think if AMD focused on consumer cards more, it could be a game changer.

Nvidia is nothing but overpriced and if AMD is able to offer something really cool for really cheap, it might spark an interest in the gaming community which eventually means a win in the overall global market.

taeric · on June 2, 2023

I want to believe this, but I have been burned way too many times in the past trusting anything other than Nvidia. The list of companies that were supposedly going to be better is rather exhaustive in the space. It is frustrating.

That is, what makes you think Nvidia is a) overpriced and b) not doing the best they can?

NayamAmarshe · on June 2, 2023

I bought an RX580 and it's still working great! It was an excellent card with no drawbacks and was relatively cheap too.

I'd like AMD to go back to the RX580 days, when the offerings were simple, cheap and made sense.

I also have an Nvidia card, I don't have any complaints but they are NOT cheap that is for sure.

Nvidia was charging extra for DLSS. RTX2050 vs GTX1660 had a huge price difference. When the leak happened we found out DLSS doesn't even require AI cores to run properly, it was purely a software limit. So they were really selling overpriced and underpowered hardware.

The bad thing is, Nvidia keeps increasing the prices because AMD is not competing well enough and there's no limit to Nvidia's pricing. Top of the line cards used to sell for $300 and we thought that was expensive.

bick_nyers · on June 2, 2023

Unpopular opinion but NVIDIA is not overpriced when you consider the fact that it's the only product that actually works.

I've also been burned so many times by AMD GPUs that I have more faith in Intel catching up in the ML space than AMD.

Would love to be wrong though.

dlivingston · on June 2, 2023

Would you mind commenting on how you've been burned by AMD GPUs in the past?

bick_nyers · on June 2, 2023

In DDR3 era heat, drivers, crashes (lots of crashes) during gaming.

In DDR4 era heat, drivers during game dev (substance painter, UE4 etc.). Lots of finicking with the card/drivers/software.

On my home NAS (Ubuntu for longest time, now Kubuntu) I've had more random issues with my RX 580 than my 3060, I know it's not the most fair comparison due to their age but still.

taeric · on June 2, 2023

For me, it was buying a graphics card that was endorsed by Intel that turned out to not support any of the advanced graphics that were happening at the time. I was less than happy. I think Matrox, back in the day, was also a bit of a disappointment. Supposedly they supported standards and were going to be amazing. Reality is they were not amazing.

Hamuko · on June 2, 2023

I've received two broken AMD cards in the last 30 days. It's also the very literal definition of "burned", as they're consistently hitting >110°C hot spot in minutes. The RMA process is also god awful and the software isn't that good.

It feels like there's actually no other option than Nvidia.

batman-farts · on June 2, 2023

I can agree on one point: if I want 3D acceleration to Just Work on Linux and I'm muting my inner Stallman, the Nvidia binary drivers have always enabled thtat for me. But on the gaming side, I definitely get the feeling that a bit of Microsoft syndrome is starting to set in at Nvidia: we're by far the market leader, so you'll take what we give you. DLSS is constantly pumped in their marketing (and by reviewers, who are sometimes adjunct marketers) as a no-brainer upscaling solution that you don't need to ever turn off. But I've had two games (Death Stranding and Marvel's Midnight Suns) crash repeatedly and unpredictably with DLSS enabled, then run happily stable once DLSS was turned off. I only even became aware of the Marvel game because it was advertised in their Game Ready! driver update, but both the drivers and the game clearly weren't ready. In that particular case, it was also primed to devolve into a circular firing squad between Nvidia, Epic providing Unreal Engine, and the game developer as to who implemented what wrong... something I think we'll probably continue to see.

As far as overpricing goes, I think the pushback (and AMD's pricing advantage) will definitely come on VRAM. I was only able to get a 3080 10GB close to MSRP when the GPU shortage started to abate, and people are already reporting that it's maxing out that amount on Diablo 4 at 1440p ultrawide max settings. Yes, there's been inflation, Moore's Law isn't what it used to be, and it had been years since I had bought a discrete GPU, but that doesn't change the fact that I've paid a premium price and I'm not future-proof for 4K or ultrawide, either of the two popular monitor upgrade paths. The bulk of this can be attributed squarely to Nvidia's desire to maintain market segmentation and profit margins. If AMD really can close the yawning CUDA gap on the software side and start to force more commoditization in the GPU market, it can only be a good thing.

taeric · on June 2, 2023

Certainly don't take my post as argument that they are the best company I can imagine. I'm fairly convinced this area is far harder than most internet commentary allows.

dpflan · on June 2, 2023

Indeed, consumer/edge. There is of course model training and model execution. NVIDIA seems really poised for training, of course can be nice for execution at cloud scale, but consumer/edge is probably all about execution.

tpmx · on June 2, 2023

Well, no. AMD's AI future is not about consumer cards.

jpgvm · on June 2, 2023

Jim Keller saved AMD and chances are his company Tenstorrent will end up being a big dog in AI. AMD should buy his startup in an all-stock deal given the lofty AMD valuation. Bring the maestro back, win the AI generation of chips.

tpmx · on June 2, 2023

He does hardware, not software (wherein AMD's problem lies).

Hardware-oriented companies often struggle to build good software stacks for the simple reason that they don't know what's good or bad in software, so they don't know who to hire.

JonChesterfield · on June 2, 2023

Some hardware guys think software is easy. You can fix it in the field and most stuff is hacks in tcl, nothing difficult there. This leads to grossly inaccurate ideas about complexity and time in software development.

In the worst case, you get a model where the software guys aren't worth paying well because their stuff is easy, and somehow their stuff is always late and broken anyway so they can't be worth much.

Then noone can sell the hardware because it's worthless without the code and the company fails and it's all the fault of the software engineers who couldn't do the easy thing that, uh, the hardware guys also couldn't do.

williamDafoe · on June 2, 2023

Jim Keller had nothing to do with the rx6000 series of AMD graphics cards and this is the best series that AMD has ever produced! I think you have the credit Lisa Su with firing Raj Koduri who went to Intel to make his flaky-driver hot graphics cards there! Raj is now laid off at Intel, too!

silisili · on June 2, 2023

I'm actually curious to see how it plays out. There's no shortage of bickering Keller vs Su here, it seems every AMD thread devolves into it.

If Tenstorrent 'wins', while Intel is able to catch and beat AMD, I think that would prove the strongest case for Keller.

If AMD keeps staying as far ahead and they have been, it makes a strong case for Su.

The truth is likely a combination of the two.

foota · on June 3, 2023

This is somewhat off topic, but does anyone know why they haven't tried putting SSD in a GPU like chip for inference? My understanding is that SSDs have a different access pattern for reading that adds latency vs RAM, but I don't otherwise know why it couldn't serve as a high bandwidth source for reads (ignoring write latencies and durability concerns).

anoother · on June 3, 2023

It has been tried once -- Radeon Pro SSG

foota · on June 3, 2023

Maybe. I'm not sure how it was implemented and I didn't see much in depth discussion about it, but it seems to me like that's more just a normal SSD sitting on the card, I was thinking something closer to an SSD with a vastly different (and likely more expensive) interface allowing for much higher bandwidth than an SSD typically offers, since there's not much advantage to having the SSD if you can't read it into cache quickly.

contingencies · on June 2, 2023

"AI in everything". Err, yeah. AI in your breakfast cereal. What's the point? Show me the applications, IMHO most of the time it's not adding value. LLMs generating mediocre content faster than humanity is not a long-term business model. Ten thousand CS grads running ROS with standard vision algorithms who don't understand it's the least efficient way to approach most problems. We've got the west jumping up and down about how China or Russia's being denied chips, but the fact is chips from many generations ago are adequate for most deployment scenarios today. Anything hyper specialist is sold on-sensor-chip anyway, so where's the relevance for AMD going forward as desktop declines and Samsung and Apple have their own mobile solutions? Do large scale chip makers really run on such an extremely faddy, hypey, business model, or is Su just running out of steam?

barbariangrunge · on June 2, 2023

Smart light bulbs, powered by GPT-5

ByThyGrace · on June 2, 2023

Hallucinating invisible light in the UV spectrum instead when asked to turn itself off.

bee_rider · on June 3, 2023

Somebody put chatGPT in my Cheerios, now they can intelligently spell ooooo.

Fervicus · on June 3, 2023

Should have gone for Alphabet Soup instead.

DarthNebo · on June 3, 2023

They should undercut Nvidia at both pricing for datacenter cards & increase CUDA like framework adoption with developer accessible cards & boatloads of VRAM. Dev's wont buy cards unless they demonstrate perf/$ over nvidia for standard models

1-6 · on June 2, 2023

AMD can take it with FPGA's and their Xilinx purchase. GPUs are unsustainable in the long run.

andrewstuart · on June 2, 2023

I’m not seeing the crown winning AMD strategy:

“AMD fails again: Radeon RX 7600 Review” https://m.youtube.com/watch?v=Yhoj2kfk-x0

“AMD is a Mess: Radeon RX 7600 GPU Review & Benchmarks” https://m.youtube.com/watch?v=MCxYfXe1DAA

“AMD Radeon RX 7600 review: Another water-treading midrange GPU for $269” https://arstechnica.com/gadgets/2023/05/review-amds-269-rx-7...

“Equally Disappointing - $269 AMD Radeon RX 7600 Review”

https://youtu.be/1JFSwWanIPw

bee_rider · on June 3, 2023

These all seem to be reviews for the RX 7600? I don’t think their whole plan will unravel because of one (based on the headlines) miss, but if it does, then it was a bad plan.

andrewstuart · on June 3, 2023

The 7600 perfectly illustrates the heart of AMDs problem.

AMD now simply copies Nvidia and makes their price a little lower but still way too high.

Even when Nvidias product is garbage - that's what AMD does too.

The 7600 is a microcosm of AMD and its strategy - don't built awesome cards at highly competitive prices - build whatever Nvidia is building and make it a little bit cheaper.

AMD and Nvidia's new GPUs are barely faster than the previous generation from years ago, and in many cases, slower.

It's not a winning strategy because its not giving customers what they want.

adamsmith143 · on June 2, 2023

This is a cute headline but let's see AMD put out some software to actually run DNN training on their hardware. Currently either impossible or too onerous to be worth the trouble.

renewiltord · on June 2, 2023

More RAM and CUDA-equivalent quality. Otherwise worthless.

fennecfoxy · on June 5, 2023

I guess this has basically been put out bc Nvidia hit 1T (or were close to when the article was being written).

m3kw9 · on June 2, 2023

Gonna take a collosal screw up by Nvidia to give that up

mtkhaos · on June 2, 2023

I'm looking forward to seeing a Chiplet DPU similar to Nvidias offering. There is a lot room for growth especially for the professional market.

Decabytes · on June 2, 2023

I think the work that Lisa Su has done has been great. But their graphics division has been playing catch up with Nvidia for years.

williamDafoe · on June 2, 2023

NVidia loves to cite steam surveys to mislead customers but the truth is they have 0% market share in consoles (not in steam surveys) and the new AMD M780 iGPU (7940hs APU) just wiped out NVidia's entire MX product line and is literally 2x faster than Intel's best iGPU and equal to an m1650 laptop GPU! If they are not careful NVidia could fall below 50% of the graphics market very very soon!

I remember a time when ATI 9600 was the #1 card on the market. It can come again, and quicker than you think ...

cypress66 · on June 2, 2023

Market share on consoles is just a matter of business deals. On PC instead, market share is more of an indicator of how good a product is because people get to choose the GPU.

In terms of technology Nvidia is quite ahead, and has been for more than a decade.

> the new AMD M780 iGPU (7940hs APU) just wiped out NVidia's entire MX product line and is literally 2x faster than Intel's best iGPU and equal to an m1650 laptop GPU!

Pretty sure Nvidia doesn't give a shit about such a low end market segment.

sagarm · on June 3, 2023

Just a matter of business deals -- but Su "played a prominent role" in those deals, at least according to Wikipedia. Those deals kept AMD alive.

shaklee3 · on June 3, 2023

Do you have a source on that? Consoles are usually loss leader type products and bring more recognition than money. There's a reason the systems are so cheap.

tredre3 · on June 2, 2023

> but the truth is they have 0% market share in consoles

125.62 million Switch sold would beg to differ.

Dalewyn · on June 3, 2023

>If they are not careful NVidia could fall below 50% of the graphics market very very soon!

If we're gonna talk about GPU market share, the dominant leader has always been Intel. Do you have an Intel CPU? It has an Intel GPU inside.

rowanG077 · on June 2, 2023

We desperately need CUDA to die. I have some hope the EU will see the light and ban CUDA if Nvidia refuses to open it.

JonChesterfield · on June 2, 2023

Nah. It'll be killed by running C++ on GPUs. No more gnarly bespoke language, just a header with intrinsics hiding in it. Doesn't need anything other than time and evolution to push it off the cliff.

rowanG077 · on June 4, 2023

I don't think that will ever happen. C++ is fundamentally a sequential language. It just doesn't match the execution model of a GPU.

CamperBob2 · on June 2, 2023

What keeps AMD from clean-room reverse-engineering CUDA, a la Compaq? Patents? If so, that's a problem caused by government, so I wouldn't look to them to fix it.

mikepavone · on June 2, 2023

I'm not sure why AMD hasn't just directly implemented CUDA, but they did ship HIPify which can translate CUDA code to HIP. Doesn't work for everything, but seems to work for a lot of important ML code. ROCm stack doesn't seem very fully baked though. Only a handful of consumer GPUs are officially supported and stability seems less than great

JonChesterfield · on June 2, 2023

I'd assume it's fear of being sued by Nvidia. Probably quite well founded.

CamperBob2 · on June 4, 2023

Right. A problem caused by government-sanctioned IP rights.

osti · on June 2, 2023

I think attributing AMD's current success to any one person is an insult to all the other employees of the company.

cdibona · on June 2, 2023

It does ignore a ton of contributions from a ton of people, but that is the job of the CEO and cofounder. If anything I always want to know more about Chris Malachowsky, Nvidia's cofounder. Their relationship I thought was pretty special, esp in the beginning. Nothing like the woz/jobs relationship, imo.

nightowl_games · on June 2, 2023

If it was a failure would you feel comfortable saying the buck stops with Lisa?

The CEO has more power and responsibility. They receive blame for company wide problems and conversely deserve credit for company with success.

osti · on June 2, 2023

If it was a failure I wouldn't attribute that to Lisa Su either. Because AMD's success kinda started around the time Lisa Su joined, the products they came out with at the time certainly didn't have much to do with Lisa, technology wise or strategy wise.

There are definitely many cases where CEO should bare responsibility for the company's success or failure, but definitely not in the case of AMD.

cmsj · on June 2, 2023

Dr Su became AMD CEO in 2014 when they were deep in the ruinously awful Bulldozer CPU architecture.

The first Zen products didn't launch until 2017 - that is when their "success kinda started" at least in the CPU market.

Here's a quote from Suzanne Plummer, senior director of the Zen project, in September 2015: "This is the first time in a very long time that we engineers have been given the total freedom to build a processor from scratch and do the best we can do."

You could make a reasonable argument that supplying the GPUs for PS4 and Xbox One, both of which pre-date Dr Su's appointment, is their "success kinda start[ing]", but one could also argue that AMD was really the only vendor capable of providing the APUs needed.

osti · on June 2, 2023

CPU development cycle is definitely longer than 3 years, especially for a big change like that. Zen architecture design certainly started before 2014. So I think it's fair to say that she started around the time AMD became successful.

burnte · on June 2, 2023

It's not, though. Zen didn't launch until 2017, she started in 2014. Early work began in 2012, yes, but it was on track for 3 years under her before launch, she deserves a lot of credit as leader to bring that project to fruition.

cmsj · on June 3, 2023

Technically it was somewhat under her from its inception - she was a VP and general manager of AMD from like January 2012, months before Jim Keller was hired.

I dug up an old interview from Anandtech that I'd forgotten about when I first wrote my comment above, which has this quote from Dr Su:

"If I put credit where credit is due, Mark Papermaster had incredible vision of what he wanted to do with CPU/GPU roadmap. He hired Jim Keller and Raja Koduri, and he was very clear when he said he needed this much money to do things. We did cut a bunch of projects, but we invested in our future."

burnte · on June 3, 2023

Thanks for the quote, that's a great leader.

scrlk · on June 2, 2023

"Designing microprocessors is like playing Russian roulette. You put a gun to your head, pull the trigger, and find out four years later if you blew your brains out." (attributed to former DEC CEO Robert Palmer)