Intel: A Bug and a Pro

rwmj · 2025-03-24T22:51:17 1742856677

Interestingly Linux uses "Coe's ratio" (4195835/3145727) dating from 1994 to check for the FDIV bug.

Coe's ratio: https://people.cs.vt.edu/~naren/Courses/CS3414/assignments/p... (3rd page)

Linux code: https://github.com/torvalds/linux/blob/b0cb56cbbdb4754918c28...

jmclnx · 2025-03-24T22:27:39 1742855259

About the F00F bug, and a nice read.

Edit:

Seems IBM stopping shipment of these buggy PCs forced Intel to fix the problem. I cannot imaging today's IBM doing something like this.

comex · 2025-03-24T23:41:53 1742859713

No, F00F was a different bug from a few years later.

pixl97 · 2025-03-24T23:52:27 1742860347

Yep, locked the PC up dead.

rbanffy · 2025-03-25T09:25:22 1742894722

> I cannot imaging today's IBM doing something like this.

Today's IBM doesn't ship Intel servers, or Intel anything. They sold that part of their business to Lenovo.

piokoch · 2025-03-25T12:26:43 1742905603

People forget that a company known as IBM went bankrupt, was sold and reborn as a consulting enterprise like Accenture or McKinsey. The only thing that remained from the olden days are servers, in particularly z-series, which are modern days equivalents of mainframe solutions (and they are pretty awesome).

perching_aix · 2025-03-25T20:34:24 1742934864

Sounds like you might be personally familiar with (their?) mainframes.

Could you explain what's awesome about them, and what even are mainframes? Whenever I tried looking into this, I walked away with "just servers networked together".

rbanffy · 2025-03-25T17:51:28 1742925088

They almost went bankrupt. Starting with Lou Gerstner, IBM cut out the least profitable parts (commodity PCs, then, later, commodity servers), and focused on services (which also acts as a sales enabler) and high-margin hardware and software.

wglb · 2025-03-24T23:41:47 1742859707

However I was quite concerned at the time when they said, quite incorrectly that there was no risk to their users from this bug.

charcircuit · 2025-03-25T00:41:50 1742863310

What is the risk? It seems very small to me.

wglb · 2025-03-25T00:46:13 1742863573

The risk is that some divisions will be fully off. If it is a chain of calculations, e.g. some stress analysis, or a spreadsheet involving a chain of calculations for a financial report, it could be bad.

charcircuit · 2025-03-25T00:53:28 1742864008

Divisions being off isn't the end of the world. Even without the bug the division can be fully off due to it using fixed precision floats.

Stress analysis and financial reports are more likely to be wrong due to other sources of error than a division being slightly off. If you really wanted exact numbers you wouldn't be using fixed precision floats anyways.

wglb · 2025-03-25T01:33:36 1742866416

From the wikipedia article:

Abrash spent hours tracking down exact conditions needed to produce the bug, which would result in parts of a game level appearing unexpectedly when viewed from certain camera angles.

charcircuit · 2025-03-25T04:33:21 1742877201

Yet, they thought the 1 frame flash was insignificant enough to ship the game with it instead of spending time to workaround the bad division. But thank you for providing an example.

hansvm · 2025-03-25T01:57:03 1742867823

Alright, then quantum chemistry simulations. It's very common in the field to have algorithms with known error bounds given a certain floating point size and to choose a size amenable to the scale of simulation you intend to attempt. If some of your computations are at half precision, the results are hosed.

charcircuit · 2025-03-25T04:19:10 1742876350

Most consumers are not doing quantum chemistry simulations.

insufferable_tw · 2025-03-25T03:35:09 1742873709

This is a perfect example of "normalization of deviance".

brookst · 2025-03-25T04:17:21 1742876241

aka six sigma?

eru · 2025-03-25T05:11:45 1742879505

You don't need precise numbers to figure out whether your bridge will stand. What you need is a calculation designed to be robust to the errors incurred in measurement and computation.

The standard for floats guarantees you specific and precise error bounds that you can use to do an error analysis for your whole calculation. Most likely whatever engineering software you use to check your bridge design, will already have this error analysis baked in.

If you introduce some arbitrary other errors, you'd have to redo you error analysis from scratch. And it might not even be tractable, depending on the errors introduced. (The standard floating point error guarantees are designed to behave reasonably well and easily predictably when combined into a larger calculation.)

colechristensen · 2025-03-25T06:55:24 1742885724

You just have no idea what you're talking about. People get killed when things go wrong, and this "oh well other problems are probably worse" attitude is dangerous.

There's no such thing as exact numbers, but there is such a thing about reliable models. The errors introduced by calculating with numerical methods are studied and well understood, a processor not following exactly the rules it's supposed to is an enormous problem.

Here's a little introduction to condition numbers and how they're used to understand floating point error introduced in calculations:

https://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/le...

charcircuit · 2025-03-25T07:59:33 1742889573

The FDIV bug is not theoretical. It existed and no one died from it. People love to come up with theoretically how the bug can cause terrible things to happen, but in practice it didn't. The next run of the processor had the fix and the world moved on.

colechristensen · 2025-03-25T08:58:31 1742893111

1. Intel wasn’t very popular for scientific computing in 1994

2. No one was stupid enough to make life critical calculations on Intel after it was discovered and widely publicized

You, on the other hand, are suggesting it was no big deal and acting like people doing important work should have just ignored the bug. The reason bugs like this didn’t kill people in a large disaster is that folks with your disposition weren’t in charge of making decisions that would have led to that.

They did a recall that cost Intel a billion dollars adjusted for the present. It wasn’t just ignored.

charcircuit · 2025-03-25T16:25:03 1742919903

>and acting like people doing important work should have just ignored the bug.

No, I am acting like the average consumer could have ignored the bug. There wasn't a need to do a mass recall of every chip as the chip would still be fine for most users. Yes, there was a recall for people who needed it to work correctly, but in practice not everyone needs it.

MBCook · 2025-03-25T01:34:12 1742866452

Yeah but when the bug triggered you only got like eight digits worth of floating point.

The article says IBM expected normal users to hit it every few days.

charcircuit · 2025-03-25T04:18:38 1742876318

Hitting the bug doesn't mean that it would cause a practical issue for the user.

wglb · 2025-03-25T20:09:46 1742933386

And another note:

Locked reads must be paired with locked writes, and the CPU's bus interface enforces this by forbidding other memory accesses until the corresponding writes occur. As none are forthcoming, after performing these bus cycles all CPU activity stops, and the CPU must be reset to recover

inetknght · 2025-03-25T02:41:16 1742870476

> If you really wanted exact numbers you wouldn't be using fixed precision floats anyways.

Let the adults play with things that need to work exactly as documented (such as IEEE 754 floating point representations) and therefore can be relied upon when required. You can go back to building your little unreliable toys that nobody uses.

charcircuit · 2025-03-25T04:21:42 1742876502

There is no need to belittle me while not providing a practical example where the average consumer can be harmed by this bug.

inetknght · 2025-03-25T15:03:52 1742915032

> not providing a practical example where the average consumer can be harmed by this bug.

Why is a practical example necessary in this case? Why are you not able to recognize the very serious harms that were already described by people 30 years ago and during the intervening time? Why are you demanding that I spend my time to find and give you that information instead of you? I am not your personal tutor.

charcircuit · 2025-03-26T03:07:32 1742958452

Look at my original comment. I asking for a clarification on why the other person believes that intel was wrong and a risk actually was present. Instead of backing up the claim people swarmed me with hypothetical scenarios that don't prove that those scenarios were common enough to happen and cause a problem. I am not demanding your time. You were the one who joined the conversation smugly asserting you knew better than me. You could have just ignored me if you didn't want to answer my question.

genewitch · 2025-03-25T07:27:18 1742887638

humans are potentially harmed when developers use an unsigned int as a counter and it rolls to zero. Or a byte, in the case of medical radiation machines.

I guarantee if you had access to a full nntp text dump from this era you'd find some "harm"

Intel is dead, long live Intel.

charcircuit · 2025-03-25T08:06:57 1742890017

>when developers use an unsigned int as a counter and it rolls to zero

Yet, people wouldn't expect to return their CPU if this happened. The entire technology stack of a computer is filled with bugs, yet people are able to use them to great utility every day.

fragmede · 2025-03-25T06:57:50 1742885870

therac 25?

charcircuit · 2025-03-25T08:01:49 1742889709

The average consumer with a pentium didn't have it in a radiation therapy machine.

rbanffy · 2025-03-25T09:27:08 1742894828

But those who used a Therac 25 wouldn't be happy.

immibis · 2025-03-25T16:03:18 1742918598

Therac 25 didn't use a Pentium, nor floating point, and if it did, a 0.0001% increase in the radiation dose would be unnoticeable.

rbanffy · 2025-03-25T18:04:04 1742925844

It didn't, but the joke is still funny.

bendercorn · 2025-03-25T05:12:27 1742879547

Should have mentioned that intel marketing and PR launched the User Test Program to make sure advanced users got early PPro systems to make sure there were no lurking fdivs. Nicely was the first recipient. So was John Williams the composer.

rbanffy · 2025-03-25T09:24:09 1742894649

John Williams from Star Wars or John Williams from Sky?

defrost · 2025-03-25T09:29:10 1742894950

rim shot: https://youtu.be/QgbgUrp1a70?t=67

FWiW I bought (one of) Kevin Peek's guitar amps in 1982 in Kalamunda (W.Australia) .. it was just an ad in the classifieds, got out there and it was a damn near world class music studio in a one room music cabin in the bush.

ankitg12 · 2025-03-25T06:22:40 1742883760

Related discussion sometime back - https://news.ycombinator.com/item?id=42535071

jeffs4271 · 2025-03-25T13:06:19 1742907979

I am just shocked to see a MIPs R8000 reference in 2025. It was relatively obscure CPU targeted at HPC and weaker at integer workloads. I worked on a lot of cool stuff, but that project was probably the most fun.

wuming2 · 2025-03-25T22:39:30 1742942370

Care to expand?

Also I often wonder what the aggregate statistics of `top` load average and standard deviation looks like over billions of consumer devices over time.

adrian_b · 2025-03-25T09:07:58 1742893678

The Intel marketing is responsible for a large number of despicable decisions during the years, but I consider that by far the most despicable thing done by them happened when they have segmented their CPU products into Pentium and Pentium Pro.

Later, they have dropped the "Pro" naming scheme and the successors of "Pentium Pro" have been branded as "Xeon", until today.

IBM had been wise and they had incorporated memory error detection as a standard feature of every IBM PC, so that has also been true for all IBM PC clones.

By the early nineties, when the memories packaged in dual-in-line packages have been replaced by memory modules, you could buy complete memory modules with error detection, but there were also slightly cheaper memory modules without error detection, so a computer owner could choose either of them. I am not a gambler, so I have always used only modules with error detection.

However that has changed in 1994, when Intel has decided to split their CPUs into Pentium for "consumers" and Pentium Pro for "professional users" who were willing to spend much more for a workstation or server computer.

This is when Intel has decided that in order to stimulate their customers to buy overpriced "Pro" CPUs, memory error detection must be removed from their "consumer" CPUs.

While in 1993 the first generation of computers with Pentium still had memory error detection, for the second generation in 1994 (with the Triton chipsets), memory error detection was removed, in preparation for the launch of Pentium Pro next year.

We will never know the value of the financial losses that have been inflicted worldwide upon naive computer users by this Intel decision.

Fortunately for Intel and unfortunately for us, software bugs have always been so frequent that all computer users have been conditioned to assume automatically that whenever the computer crashes or data corruption is discovered, the cause must have been some software bug and it must be difficult or impossible to determine the exact culprit.

Despite this common assumption, many of these incidents may be caused by hardware memory errors and besides the noticed incidents there may be many other cases of data corruption that have never been discovered.

The claim of Intel that this removal of memory error detection has been done for the benefit of the customers, to reduce the price of the computers, is of course false. After it became impossible to have memory error detection in "consumer" PCs, there was no price reduction in motherboards or memory modules, the prices have remained the same and their vendors had increased profits, so they have supported enthusiastically the initiative of Intel.