Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another NASA screw-up that they're trying to pin on the vendor engineers, just like Challenger.

The title is not reassuring. Conservatism in engineering is essentially about creating safety margins through conservative estimation. The title is saying we need to be careful because a tile was likely penetrated. Hell, if I remember correctly they were reporting that there was known tile damage on the news before reentry, but that they didn't know the extent.

"NASA felt the engineers didn’t know what would happen but that all data pointed to there not being enough damage to put the lives of the crew in danger."

If you thought they didn't know, then ask them what they do know! It's right on the slide that flight conditions are outside of test parameters and that the mass of the projectile was much higher. How the F do you work at NASA and not understand the basic principles of mass, velocity, and energy well enough for that to stand out enough to ask questions or run your own calculations...

The reason the slide is laid out the way it is, is because it's describing the thought process and creates a deductive argument for how they got to their concern. This is a presentation for a briefing for other engineers, not a conference or sales pitch. It's supposed to be formal and contain the synopsis of technical points. Using projectors for technical briefings predates the use of PowerPoint. I see nothing wrong with the layout in that context.

Edit: why downvote without a reply? NASA has a history of blaming vendors when they screw up. This looks like another example to me. The presentation format does not have any issues given the setting and target audience.



Tufte is not an employee of NASA. He is an employee of Yale, and a thought leader in information design. HE was the one saying the slide design is poor, and doing so not in the interest of assigning blame, but in the interest of highlighting ways to communicate better.

To say that "the slide doesn't have any issues" is laughable on the face of it. But it's immaterial; your claim is that "NASA just ignored the engineers from Boeing" rather than "NASA didn't understand the engineers from Boeing". Communication is a two party process, and believe it or not, NASA isn't actually incentivized to take risks that lead to loss of life and damages public perception of them; it's far more likely they didn't understand the stakes, and looking at the slide from that perspective, it's very easy to see why they would not have understood the stakes even if the Boeing engineers did.


I understand that he's not a NASA employee. Do you think it's fair to claim that the slide killed 7 people? I don't. Could it be worded better or have a better layout - sure. But there's no problem with that slide that would support the claim that 7 people died because of it. The information was there.

As you said, it is a 2 way street. Slides are accessories. Do you have the conversation that unfolded during this slide and presentation? Did the audience ask questions about things they didn't understand?

Is there even any evidence that NASA didn't know about the damage or had a rescue plan?

You claim they wouldn't have taken the risk, yet if I remember correctly they had no rescue plan and gave a relatively low (70%ish maybe) survival rate. Low level employees did raise concerns about severity of the damage. This seems to support the idea rather the communication between the vendor and NASA was sufficient since some NASA employees shared the same view.

https://www.dailymail.co.uk/news/article-2271525/It-better-d...


There was serious consideration to sending Atlantis as a rescue mission as Columbia was not in a position capable of rendezvousing with the ISS to use the later as a lifeboat. To your point, subsequent missions were required to have a formal rescue mission outlined.


Agree. Also where is the executive summary for that slide on that slide?


> Also where is the executive summary for that slide on that slide?

Q: Shouldn't every PowerPoint slide be an executive summary? PowerPoint can be a terrible way to [attempt to] present detail.


I believe they were using that question to point out the absurdity of making a slide that long by suggesting that it is made even longer with a summary of itself.


> The title is saying we need to be careful because a tile was likely penetrated.

It is a godawfull title. I believe that is how it reads to you but it totally reads the opposite to me. I read the title and it translates in my head as “we reviewed the test results and they suggest that the tiles are built sturdy enough to not get penetrated”. Exactly because what you say the word “conservatism” means to me that a system is designed to meet the loads plus reasonable safety margin. So if the review of test data indicates conservatism that means to me that the test found the test object roboust even with a reasonable safety margin. Otherwise i wouldn’t say that it “indicates conservatism” but that it “indicates lack of safety margin”.

> The reason the slide is laid out the way it is, is because it's describing the thought process

I agree, but that is not a good thing. People think in all kind of haphazard ways, before you communicate to others it is on you to look at your ramblings and make it orderly. The penultimate sentence is the most important one that should go first “flight conditions is significantly outside of test database”. That doesn’t mean that the tile is broken, nor does it mean that it is not broken. It means that we can’t tell from our tests.


I read the title as saying "be careful when predicting tile penetration" based on one of the first point they make being that the method used to predict penetration "overpredicted penetration of tile coating significantly".

Though in fairness they then have sub points that kind of contradict this conclusion rather than support it.

But honestly this slide would make more sense if several of the sub points where top level independent statements instead.


> based on one of the first point they make being that the method used to predict penetration "overpredicted penetration of tile coating significantly".

That point read to me as if they completely overestimated the damage.


> Another NASA screw-up that they're trying to pin on the vendor engineers, just like Challenger.

And, for context, Edward Tufte, whose review of the slides is being referenced in this article, is the same one who misunderstood and misrepresented what the actual issue was during the briefing the night before the Challenger launch, in his paper reviewing the presentation the engineers made then.

Edit: previous HN discussions of Tufte's Challenger review here:

https://news.ycombinator.com/item?id=10989358

https://news.ycombinator.com/item?id=19034783


Yeah, its very hard to say that Challenger was because of a miscommuncation from the vendor engineers. They wrote out directly [0] "Recommendations: O-ring temperature must be > 53F at launch", and the temperature of the air at launch was 36F and measurements on the solid rocket boosters (where the o-rings are located) got down to 25F and 8F [1].

But their recommendation was challenged by the NASA SRB managers. And after an offline discussion the SRB vendor came back and had changed their opinion that it was safe. And the NASA SRB manager never brought up the o-ring temperature concern to the rest of the management team.

[0] https://history.nasa.gov/rogersrep/v1p90.jpg

[1] https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...


> their recommendation was challenged by the NASA SRB managers. And after an offline discussion the SRB vendor came back and had changed their opinion that it was safe.

Yes, because the Thiokol managers overrode their engineers and said it was OK to launch.

> And the NASA SRB manager never brought up the o-ring temperature concern to the rest of the management team.

That was probably because NASA had already classified the O-ring issue as a Criticality 1 flight risk, which is supposed to mean that a failure could result in loss of vehicle and loss of life and that the Shuttle can't fly until the issue is resolved--and then waived it. So it wasn't as if the O-ring issue was a new one or that NASA wasn't already aware of its seriousness.


Wow, I didn't know that. What a poor track record.


Maybe, but I don't think anyone can deny that it is an appalling slide. I count 3 spelling and grammatical errors alone.

It looks like the bad "before" example in a presentation skills workshop. This was created by engineers working on life and death issues involving billions of dollars of hardware.


> Maybe, but I don't think anyone can deny that it is an appalling slide. I count 3 spelling and grammatical errors alone.

The slide from the post that you're referring to (which is 16:9) is a fabricated, probably created for this post, and is not a faithful copy of the actual deck that was presented in 2003. The actual slide—complete with a Boeing watermark, sans some of the errors and presentation issues we can see in the blog post, and in 4:3, of course—can be seen in Tufte's book.

Who knows why the slide in this post was fabricated (and why the author failed to indicate this fact anywhere).


I agree with you, but to be charitable, this was a real-time and evolving situation where I'm willing to bet the slide was expected to be finished "yesterday"


This seems to be the kind of slide made by someone without a lot of experience presenting work.

The other issue is, some people really, really, don't want to speculate.

In this case it seems that the person who made the slide probably assumed that the tile could be broken with a high enough probility. But because it was outside all available data, the slide says that we don't really know.

Of course, anybody in a position to make such a go-no-go decision should have enough experience talking to engineers, and seeing this effect in action to recognize the slide for what it is. It is really weird to conclude that based on absence of data, it is probably safe.


>It is really weird to conclude that based on absence of data, it is probably safe.

Considering that's exactly what happened nearly 20 years earlier with Challenger, it seems to be more common and likely the result of a number of cognitive biases. We read these with some hindsight and are disconnected from all the other pressures (schedule, budget, peer, etc.) they are dealing with at the time.


That points to a far more fundamental problem. Related to information processing higher up in the organisation. Just making better slides is unlikely to solve that problem.


Probably correct, and I have doubts that those types of problems are easily fixed because they're rooted in human psychology. It's interesting to me that the "big" incidents seem to occur every 15-20 years, almost as if there is a new professional cohort who has to learn the hard way. I do think clear communication is a necessary, but insufficient, element of fixing that problem.


One thing I wonder about with these kinds of accidents: to what extend does operational experience work its way back to requirements of components.

For example, if regularly pieces of foam are hitting the tiles after launch, was that part of the specs for the tiles to handle that? Did anybody go back, take a worst case scenario of a piece foam hitting a tile (size, speed, etc.) and verify that the tiles could handle such an impact?


They'll generally use a Failure Mode Effects Analysis (FMEA). So in this example, designers would identify all the ways a tile could fail and the consequence and probability of that failure. They then go through the process of mitigating it. The order of precedence for mitigations is 1) remove the hazard, 2) engineer around the hazard, 3) administrative controls (like standard procedures), 4) personal protective equipment. The iterate around this until the risk is within an acceptable range. All those mitigations become requirements.

So let's say they identify a tile failure mode as "tile struck by object". They assign a worst-case severity to that. Let's say they knew how bad it could be and they assign a severity as "loss of crew." Then they have to identify all the ways the tile could be struck and assign probabilities to that even happening. They use a matrix that maps the severity and probability to arrive at a risk classification. If the classification is higher than their threshold, they add mitigations that either reduce the severity or the probability (or both) until it's within an acceptable risk range.

There's lots that can go wrong with this process, though. You obviously have to be able to identify the failure modes. Is there some off-the-wall failure that nobody could foresee? Maybe. Then you have to have good enough data to objectively determine the risk. In this case, I wonder if all the previous foam strikes led them to discredit the risk as being improbable/negligible to cause that failure mode. Add to that, the PowerPoint seemed to imply the model they used is too conservative (it was believed to overestimate the actual penetration). I know people involved on some hypervelocity testing of the foam and they were legitimately surprised at the way the foam acted when it was fired at higher speeds. So in this case, the risk was probably unknown beforehand, although they assumed they understood the risk sufficiently. To quote Mark Twain, "What gets us into trouble is not what we don't know. It's what we know for sure that just ain't so."

That's just one system on an immensely complex machine. It's easy to sit back with hindsight and say "Well, they shouldn't have made a decision until they did additional testing to get the data." But if they did that to every system on the Shuttle, it likely wouldn't have left the ground. In practice, engineers deal with all kinds of other cost and schedule constraints.


This issue is not that they had to ground the Shuttle until they had the data. The issue seems to be that foam was hitting the tiles with parameters outside their test database.

Why didn't they go back and test with 'real world' foam sizes?


I can only speculate.

I would push back on the idea that they would not have to ground the Shuttle. If they thought the foam could cause a loss of crew, they would ground the Shuttle until they fully understood the problem. That's exactly what happened in the aftermath of Columbia.

>Why didn't they go back and test with 'real world' foam sizes?

That's exactly what they did after the incident (while the Shuttles were grounded). If you're asking why didn't they do that beforehand, my assumption is they already had a model that they felt they could use. According to the subject PPT slides, they even thought that model was overly conservative. In addition, while foam-shedding was out of spec, it was considered "in family" meaning that they knew of the issue and felt like it was not a flight safety issue. Both their physical and mental models of the phenomena were, at best, incomplete but they didn't know that at the time.


So in your opinion, the slide said that with the impact of the foam it would have been very unlikely that the tile would have failed? In that case the inpretation by NASA of the slide was correct.

Which is weird because slide also mentions that a small increase in energy can have a disproportional effect.

I find it weird that they would rely on their model (for extrapolation) when they know that the behavior of the tiles is non-linear. If they knew that the real world was outside their testing parameters and they decided not to test, then that sounds to me like a very serious ommision.

I.e., it is weird to extrapolate tests to something 600 times bigger. Certainly if it is about impact on ceramics.


Here's how I would interpret the slide, doing my best to prevent hindsight from biasing my opinion (since we already know what happened, it's tough to do).

Bullet 1: We looked at all the model data

Bullet 2: The model tends to predict deeper penetration than what we see in practice.

Bullet 3: The model penetration is related to particle velocity

Bullet 4: The penetration is related to the particle mass and surface area (they say volume)

Bullet 5: The foam is soft, so it takes a lot of energy to penetrate the hard ceramic tiles

Bullet 6: It is possible for the foam to penetrate the tiles, though, given enough energy

Bullet 7: If the foam does penetrate, it can cause a bad day

Bullet 8: It doesn't take much beyond the penetration energy to cause a bad day

Bullet 9: We haven't run tests that match the conditions of the strike so we don't have good data

Bullet 10: The foam piece is much, much larger than the stuff we tested

Given that, I would summarize it to say "The foam strike is much larger than what we've tested. All we know is this may mean there was significantly more energy involved, and if it's above the penetration threshold it can be bad. But the model seems to be overly conservative regarding penetration"

Now, the difficult decision is in the constraints. The astronauts didn't have the fuel to get to the ISS. They didn't have EVA suits to attempt a repair or evaluate the damage. There was no plan in place for a rescue mission. Atlantis was being prepped in FL, but was not currently ready. The astronauts probably have, at most, 30 days of oxygen.

Option 1: Allow for re-entry. Some had pegged this at about a 30% chance of success, but I have no idea what that is based upon.

Option 2: Scramble Atlantis to try a rescue mission. This is very risky for a number of reasons. Atlantis wasn't ready, meaning it would have to be rushed, increasing the chances that errors occur. Also, this type of mission was never attempted before, where two Shuttles are within spitting distance and astronauts have to migrate from one to the other. In order for the timeline to be feasible, a decision must be made within 1 day, 2 tops. This would then potentially risk two Shuttle crews instead of one, with an inherently risky mission.

I don't know what other options were available. Given when it occurred in launch window and how long it took to understand there was a problem, an abort wasn't possible. Shuttle did not have a launch abort engine like capsules do. Keep in mind, these decisions have to be made under large amounts of uncertainty. The extent of the damage was not fully known. People could scramble to run additional tests to gather data, but by that time the window on Option 2 may have closed. It's easy to arm-chair quarterback this after the fact, but there weren't good, clear options at the moment.


This was the sixth time that exact piece had broken off and hit the vehicle. It was one roll of the dice too many. And by the way, Columbia had plenty of O2 but only 30 days worth of CO2 scrubber capacity.


Thanks for correcting on the CO2.

>This was the sixth time that exact piece had broken off and hit the vehicle.

That was part of the problem and what is meant by the foam shedding being "in family." They had witnessed it enough (along with foam shedding elsewhere) without consequence that it wasn't really considered a credible risk. Until it happened during a period of the launch where the delta-v made it a different scenario, with an energy that caused unexpected characteristics about the foam.

This is second hand knowledge, but the person I know involved in the testing said they had a really hard time recreating the damage to tiles on the specs they were given after Columbia. By chance, they decided to turn the gun beyond the quoted specs (as I understand it), and all of a sudden the foam acted like a hard chunk of debris. I think it's hard for people to grasp how much is unknown, even after a disaster. Often, it's only in hindsight where it seems obvious.


This could also be related to a broader tendency to promote 'performers' who are more likely to take risks or shortcuts that they might not realize involve risks as well as people that use less resources (lower safety margins, less overlapping checks etc).

It's sadly difficult to be recognized for excellence in preventing surprises, as hard as it is to quantify that.


I think this is absolutely part of the issue. Having previously worked in the industry, people who bring up concerns are sometimes viewed as pariahs who are slowing down work. Because so many of the concerns involve low-probability events, it's possible for someone to make a career rolling the dice without being cognizant of (or open about) the risks. When bad things do happen (thankfully, major catastrophes are still relatively rare), it's hard for people to openly recognize the mitigations that could have prevented it because they think instituting them on future projects will just slow things down. It creates a culture of "the ends justify the means" where bad judgement and integrity violations are considered ok as long as the project/program was completed.


One factor for why is that bringing bad news may poorly reflect on the organization, and therefore the person’s career.


It takes some intestinal fortitude to be in a role that is tasked with communicating information people don't want to hear. It's part of the reason NASA created it's "Safety and Mission Assurance" organization after this incident and gave them a completely different chain of command. In theory, that mitigates some of the career threat, but in practice it may be different.


I tried to read the powerpoint and it was not an easy task. The main point of the powerpoint is not supposed to be an answer to a riddle of fonts and words.

A quick and dirty re-writing of the title (and slide):

_______________

Review of Test data indicates incident is well outside of safety margins.

- Volume of ramp is 1920 cu in vs 3 cu in for test

- Once tile is penetrated SOFI can cause significant damage.

- Flight condition significantly outside of test database

_______________

Now that should get a reader's attention.


True, there are better ways to word it. That does leave out some detail that the original one had, like test velocity and not showing the thought process as well (like a formula on a math slide or deductive argument in philosophy).

My main point is that the title claims thus slide is what killed 7 people and basically blames the creator, but leaves out all the other failures. Slide formatting and wording (which ignores the actual discussion that should have gone with it) is really inconsequential compared to the rest of the process in a briefing.


The slide being such a mess, makes me think that the speaker's arguments may not point to the danger so clearly either. This makes me look favorably on the title even though it is hyperbole.

Granted this is pure speculation on my part and should be treated accordingly.


Hell, it's even probably fair to make an editorial conclusion at the end -

"Given this, we strongly recommend against launch"


The damage happened during launch. Unless you're talking about how the insulation was old and NASA knew that insulation can, and had in the past, struck shuttles.


>The title is saying we need to be careful because a tile was likely penetrated.

To the point of the article, I think this is the wrong takeaway, meaning the slides were not communicating effectively.

The second point illustrates this. It says the models were overpredicting the penetration. Meaning the models were conservative and the the actual penetration was likely less than what models show. They were setting the table for an optimistic outlook.

The real issue, IMO, is highlighted later in the article where there isn't sufficient fidelity in the tests to back up those claims. Tests after the incident showed the foam acted very differently at the delta-v that actually occurred.

And regarding your point about blaming contractors, the vast majority of work done by NASA is done by contractors. NASA is, to some extent, a pass-through organization that funds other organizations like Boeing, Lockheed, Honeywell, Jacobs, etc.

>If you thought they didn't know, then ask them what they do know!

This gets to the same cognitive biases that led to Challenger, EVA 23 and a host of smaller incidents nobody hears about. Data is not objectively weighed in these situations because of schedule pressure, optimism bias, etc. In this case, most launches were showing foam shedding with no issue, so it lead to a false belief that it wasn't dangerous even though it was out-of-spec. Add to that a slide that says the models are too conservative and you can see where cognitive biases may influence the decision. Lastly, most people like to think they're self-aware enough to identify these biases in real-time, but they aren't. It's also why the incident lead to a separate organization within NASA focused on safety, quality, and risk that has a segregated chain of command.


I came in knowing the outcome, and roughly the point being made, but still found the conclusion of the slide hard to suss out. There were other failures in the chain as well for sure, but I don't think this is just a hit job on NASA vendors.


The title of this post is claiming this slide killed 7 people. That's a pretty bold and accusatory claim that seems to leave out the other failures, right?


From the article (emphasis mine):

> This, however, is the story of a PowerPoint slide that actually _helped_ kill seven people.


Any evidence to support that claim? NASA employees raised concerns about the severity of the damage, which shows the contents of the slide were effectively communicated to NASA engineers, but that leaders ignored them. Thus the slide was not a contributor.

https://www.dailymail.co.uk/news/article-2271525/It-better-d...


Pretty bad title for a blog post that talks about misleading powerpoint presentations. Some readers might make conclusions from the title alone, instead of reading the whole text below.


Yep; this slide was worse than useless in that to the given audience it could instead read as an endorsement that launching is fine.


Please read the article so you don't come across as ill informed. This was not a launch/no launch decision.


This wasn't about a launch decision. An audience of engineers would not view this slide as an endorsement.


No, even for an engineering slide this is apalling

I'm not so sure how many people are familiar with the term "conservatism" as used here. I'm not. Some might be aware, those who are not aware will just skip over

I read this slide a couple of times. There's no thought order, no connection between the topics (even if we assume people are familiar with the subject) and several typos.

It is not a Powerpoint fault it is a fault of whoever wrote this.

This is an issue with information hierarchy. If this is a risk (and I can't imagine what might have been a bigger risk at that mission) it needs to be brought into attention. Not added to line 4 of slide 7 and be done with it.


When engineers are not allowed to use "alarmist language" the organization often changes to use such unreadable texts.

I don't expect "foam strike more than 600 times bigger than test data" would go over well politically. You'd be telling the audience that these people will die while everybody watches. No manager want to be the messenger for that kind of messages.


> When engineers are not allowed to use "alarmist language"

This has been a recurring theme throughout my career, the struggle between being seen as "alarmist" and accurately conveying urgency.

We get this in the medical field all of the time. "Outcomes delivered via mechanism seen as potentially contraindicated" or some other spaghetti. I've been in these meetings many times, where we (the individual contributors) have to tell the bosses or peers or partner group about something that might be bad. As in Y2K-style of bad, where it will be bad if we don't address it but if we do address it with the urgency needed, no one will be able to recognize the success for what it is.

As you said, no one's manager wants to be seen as crying wolf all of the time, but post-hoc there's the expectation that a couple of engineers way out at the end of the limb of the tree didn't just wait for the limb to be sawed off behind them, they took out the saw and did it themselves. That they stood up in the presentation and yelled "you're all idiots! This is going to kill the entire crew! Everyone will die, don't you see?! And I'm not standing for it!" just before they rip the badge off of their lanyard or belt hook and then righteously storm out.


Yep, a lot of organizations have a cultural taboo against telling the bosses bad news. The way one manager taught me: If it's good news, use plain language and big text. If it's bad news, soften it, shrink it, slather it with jargon, obfuscate it. Don't hide it but don't raise an alarm. Couch any unpleasant language with possible-this and unknown-that. Terrible advice but that's the way a lot of exec readouts work.


Maybe it's survivorship bias and I'm only seeing in-fact-rare good examples, but organizational communication seemed so much better back in the 1940s-~1960s.

And other communication, for that matter. Instructional videos from that era make "pro" YouTube look like amateur hour, let alone modern material produced by industry and government, which is even worse.


> How the F do you work at NASA and not understand the basic principles of mass, velocity, and energy well enough for that to stand out enough

Have you ever worked for a business with “management”?


You're being downvoted because the entire process that lead to this decision has been massively analyzed and the root causes were determined.


And was it determined that this slide killed 7 people, as claimed in the title? That seems overly dramatic and ignores the other root causes.


No, it wasn't this slide.


In this case the vendor engineers were from Boeing - Ya know, the same company that brought us the 737MAX MCAS fiasco. I mean sure, hindsight is a wonderful thing but given the history of Boeing engineering culture since their merger with McDonnell Douglas I can see the possibility of someone gliding over something inconvenient

(not the one that downvoted your comment btw)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: