Using GPT-3 for plain language incident root cause from logs

wzdd · on Jan 13, 2021

It's nice to get the textual description, but pretty much every specific detail of the extended explanation teased out at the end includes things which are more or less incorrect but which nonetheless sound very believable. In essence, what happened at the end was GPT-3 was asked to write an OOM-killer-inspired story. I think this should be a cautionary tale against trying to use GPT-3 to provide commentary beyond a high-level summary.

This isn't a slight against the short-summary technique, which seems very cool.

Details: oom_adj isn't a flag, it's an int which can disable OOM on a per-process-leader basis but can also but can also be used to reduce the "badness" of a process when considering what to kill. Oom_adj is also deprecated and has been replaced by oom_score_adj. The OOM algorithm isn't called RSS. It doesn't seem to have been explicitly named, but the function which performs the key calculation is named oom_badness. This function assigns an integer "badness" to each process. A process' resident set size is an important part of calculating badness, but it's affected by several other factors (what they are depends on kernel version but they include the adjustment parameter). RSS is not (part of) the OOM calculation "by default" -- it's always included unless OOM is disabled entirely. RSS isn't a comparison of reserved physical memory against current virtual size, it's just the amount of RAM currently occupied by a process (i.e. not in swap or on disk). The OOM killer doesn't compare RSS against virtual size. RSS doesn't trigger the OOM killer. RSS isn't an algorithm.

Another interesting aspect of this, of course, is that GPT-3 likely wasn't trained on any specific kernel version, but on a large number of versions depending on which part of the Internet it happened to be reading. This means that it probably can't give a good account of any single version of fast-changing parts of the kernel like the OOM killer.

Source: https://github.com/torvalds/linux/blob/master/mm/oom_kill.c

stochastimus · on Jan 13, 2021

Yeah, that’s a fair way to look at it. Thanks for that. It also matches my experience with giving it too much unrelated input data: even the first line of the summary will often be completely fictitious in that case.

bbu · on Jan 12, 2021

This is pretty cool! However, these two samples are very simple to solve. I'd love an "AI" to find root causes for problems that are not obvious. Just throw the whole log collection at it and let it solve all the issues. One can dream ;)

stochastimus · on Jan 12, 2021

Yeah, my experience so far has been that if I just pile a bunch of logs in there, if it's not given salient lines, the language model tends to either rat-hole on some particular detail that's irrelevant, or else construct a non-factual narrative. But, when Zebrium did the picking of these lines autonomously and GPT-3 summarized them, we see a meaningful summary. Having said that, I would also like to get GPT-3 more and more savvy with larger and larger log-based prompts, and I'm hoping it's possible with some tweaks to the prompt and some finetuning. I'll keep the blog posted as we do more experiments.

a-dub · on Jan 12, 2021

this is really cool!

m463 · on Jan 12, 2021

GPT-3-stackoverflow

deeeeplearning · on Jan 12, 2021

But seriously, was Stackoverflow part of the training data used to train GPT-3? Would definitely be an interesting fine tuning experiment

stochastimus · on Jan 12, 2021

From what I've read, the answer is "yes", stackoverflow was crawled. EDIT: I looked and stackoverflow is included in the Common Crawl dataset, which is one of the datasets on which GPT-3 was trained. Having said that, it's not clear to me the degree of coverage of that domain guaranteed by that crawl... looks pretty comprehensive, though. http://index.commoncrawl.org/CC-MAIN-2020-24-index?url=*.sta...

sillysaurusx · on Jan 13, 2021

The training data was so large that I don’t think a single epoch was used. I.e. it’s not a guarantee that any specific site is known by GPT-3 just because it’s in the data.

But, with the nature of random samples, it’s probably likely that it saw at least a little bit of SO.

mckirk · on Jan 12, 2021

That's cool and all, but I'm pretty sure what we really want to see is

"The expert described what had happened, in the form of a Haiku:"

stochastimus · on Jan 12, 2021

I just tried this and I might stick with these settings! ;-) For the postgresql example in the blog, I used your prompt. Here's what I got:

The logs were in a mess, But the expert could see, That the database was in distress.

phaemon · on Jan 12, 2021

I'm kind of surprised GPT-3 doesn't "understand" haiku. You'd think it could extrapolate the rules?

The logs are broken!, Sysadmin sweeps up the leaves, The database cried

rictic · on Jan 12, 2021

The encoding used by GPT-2 and GPT-3 greatly obscures many of the textual properties of words. This at least partly accounts for why it has so much trouble with meter, rhyme, syllables, and some math.

More info: https://www.gwern.net/GPT-3#bpes

stochastimus · on Jan 12, 2021

Thanks for putting that info here!

mckirk · on Jan 12, 2021

Hahaha, thanks for trying it out!

It's honestly mind-boggling that we now have tools that even make something like this possible.

(Even though GPT-3 clearly has to read more poetry.)

im3w1l · on Jan 13, 2021

So it didn't figure out the syllable counts and it mistakenly thought the haiku should rhyme. But it's amazing that we have come to a state where the critiques are so minor.

ativzzz · on Jan 13, 2021

So what do you do when GPT generates nonsense? Because it sometimes will, at least during my experiments, create something that is irrelevant or just plain wrong and would require human intervention. In other words, what is an acceptable failure rate for these summaries you generate?

stochastimus · on Jan 13, 2021

Great question. I guess for me as a technologist, I’m coming at it from the other side and asking what failure rate is possible, and can that rate be reduced with further prompt fiddling and other machinations? Then I will try to gauge whether that rate is acceptable to users. I guess since it’s all so new, people wouldn’t know what would be useful or usable until they tried it. So we are looking for users to participate and help us find if there’s a viable product involving GPT-3 summaries.

ativzzz · on Jan 13, 2021

Makes sense. For our use case, having "incorrect" gpt output isn't a viable option so unfortunately it looks like the tech isn't there quite yet as we would need to manually verify the output, or we would need access to the model to train it on our own data, but not sure if that would improve the results.

stochastimus · on Jan 14, 2021

Here’s a question for you - for a typical L2 support person or front line MSP NOC monitor, or junior SRE, or similar - what do you think would be the error rate? I wonder if there isn’t sort of a front-end role here that can be subsumed at least in part by automatic summarization, for the purpose of streamlining the escalation and routing process overall, even though errors would occur - since I imagine they already do. What do you think?

ativzzz · on Jan 14, 2021

I've only worked at (very) small companies so I've never had to think about different tiers of support. The automatic summarization may be useful for better determining which specialist to route the request to, but in any case a wrong classification will get picked up by a human regardless and then re-routed appropriately. I imagine the error rate is already pretty high, so anything to automate/reduce that can go a long way.

I work in education tech where our product goes directly to students/teachers, so tolerance for "incorrect" answers generated by AI is much lower, or nonexistent.

vlovich123 · on Jan 13, 2021

There are classes of failures that are so difficult to diagnose that an automated tool with a 5% success rate is still super useful because of the time savings (hence the reason for log analysis tools). That being said usually the log content is insufficient for these classes of problems where context of the symptom is just as important and the logs can sometimes be useless.

brianjunyinchan · on Jan 12, 2021

Super interesting. I wonder what other latent domain-specific intelligence GPT-3 picked up during training, that is parseable with text in and text out. Like a flash cards generator?

sthatipamala · on Jan 12, 2021

Polar (https://getpolarized.io/) has a GPT-3 based flash card generator from text highlights. It's available to premium subscribers.

stochastimus · on Jan 12, 2021

Hmm, I like this direction - so maybe, as the user is navigating the incident, let them steer the model with questions and/or additional lines. Is that sort of what you'd envision?

EQVEYWDCHQ · on Jan 13, 2021

This is interesting - I worked on a similar use case by parsing and tokenizing ZooKeeper logs, then converting logs to integer sequences and trying to determine whether or not services were going to experience a fault by training on said sequences, and thus determining what the cause of the fault was/would be. Wasn't too successful but definitely showed me how difficult it can be to work backwards from logs to root cause, esp. with limited data.

king_magic · on Jan 12, 2021

I'm fairly bearish on GPT-3, but this is actually a pretty cool application.

fastball · on Jan 12, 2021

Care to expand on your generally bearish stance?

I've seen a lot of cool / practical uses for GPT-3.

NateEag · on Jan 13, 2021

One thing that bothers me is how it completely automates something like the process of citogenesis:

https://xkcd.com/978/

It was trained on the internet, a gigantic repository of half-truths and misinformation, and it has no actual understanding of truth or falsehood.

Ask it for things and you get a plausible-sounding summary, good enough that most people will not fact-check the gibberish it's spewing.

Especially not when you glorify the big pile of automated statistical analysis with the title "AI".

There are plenty of other issues with it, but that'll do for a start.

fastball · on Jan 13, 2021

None of the uses that I've thought are cool are trying to find "truth" though.

For example, just the other day I needed to make a system that was able to take user entered country names and convert them to standardized ISO country codes. With GPT-3 I made something with fantastic accuracy and error correcting ability in about two minutes.

stochastimus · on Jan 13, 2021

Yeah, I'm really hoping there's a way to reliably corral it into basic plain-language restatement of the logs themselves without it going too far afield or speculating. We will see.

jacques_chester · on Jan 12, 2021

Is there a reason I'd use this approach over a process mining / log mining system? I feel like it needs me to guess the right question to get an answer.

stochastimus · on Jan 12, 2021

Well, I've been trying really hard not to point it out because I don't want this to be like a commercial. :) But, the idea here is that the Zebrium ML picks the incident lines unsupervised; then, the GPT-3 model creates the summary unsupervised. So I guess the combination is what we've been working on in a private beta, so that the user can get the best of both worlds.

jacques_chester · on Jan 12, 2021

Gotcha. I had understood it to be purely GPT-3 somehow, rather than as a second step.