Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Classic overfitting

It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5



How is this overfitting, rather than a data quality / classification issue?


If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

Instead, it reverted to what it has seen before (in the training data), hence the overfit.


Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.

But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.

But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?


I don't think so. Overfitting = the model was too closely aligned to the training data and can't generalize towards *unseen* data. I think it saw "silence" before, so it's not overfitting but just garbage in, garbage out.


Your definition is one, but the one the OP is using is overfitting to training data.


That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

Where do you draw the line between “overfitting to training data” and “incorrect data” ?


> That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

Not really, getting 94381294*123=... wrong, but close within the actual answer, cannot be overfitting since it wasn't in the training data.


> [By] that definition any incorrect answer can be explained by “overfitting to training data”.

No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.

> Where do you draw the line between “overfitting to training data” and “incorrect data” ?

There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.


I think it's a classification issue.

Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".

Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.

Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.


I think it's a data quality problem first, which might lead to a sort of overfitting as a consequence.

How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?

It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.

So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.

So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)


It's actually because it is incapable of recognising when it does not know the answer. It will give you the nearest match, even if that is completely incorrect.


ُThe Arabic text is the translator's self credit

"Translated by Nancy Qanfar"


I know it’s off topic, but it reminded me that translators like to put in Easter eggs, or at least they used to: https://learn.microsoft.com/en-us/archive/blogs/ericfitz/i-a...


And the German is “subtitles of [public broadcaster] for [content network], 2017

I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits


> I'm not sure this is really overfitting, the network does exactly what the training data demands.

What do you think overfitting is, if not that?


Overfitting would be replicating overly specific details. Like if a specific pattern of silence (or quiet noise) matched to specific copyright notices.

But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.

If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.

Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting


The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:

- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.

- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).

So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".


Overfitting is achieving better and better scores on the training material and worse and worse scores on unseen tasks. More at: https://en.wikipedia.org/wiki/Overfitting#Machine_learning

This is just wrong training data.


fitting on noise in the training data is exactly what overfitting is. underfitting is smoothing out signal


Overfitting implies a failure to properly generalize the training data. Here it generalized them correctly. Garbage in, garbage out.


No. Because there would have been indtances in the data where silence was labelled correctly. But the model couldnt handle the null case, so it over fit on the outros. But generally it fit on the random error in the label of the null feature. Which is what overfitting is


Exactly. Underfitting would be if the model doesn't pick up on the fact that outro silence is labeled differently from regular silence and transcribes them the same


That's literally what overfitting means.

Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.


Isn't overfitting just when the model picks up on an unintended pattern in the training data? Isn't that precisely what this is?


not necessarily, no. if you have 60% of examples for silence being the hallucination, it just learns the (what you detect as) wrong connection.


Which ... would be overfitting. It picks up on a pattern in the training data that we don't want it to pick up on and which causes it to generalize poorly.


How is it overfitting if the data is garbage in the first place? Saying it's overfitting in this context has no meaning as there is no alternative that maximizes the utility function we're training for?


It is a data quality issue which caused the model to overfit.


As I didn't see one correct definition of overfitting:

overfitting means that the model is too closely aligned to the test data, picked up noise and does not generalize well to *new, unseen* data. think students that learn to reproduce questions and their answers for a test instead of learning concepts and to transfer knowledge to new questions that include the same concepts.

while this sounds like overfitting, I'd just say it's garbage in, garbage out; wrong classification. the training data is shit and didn't have (enough) correct examples to learn from.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: