Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.
But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.
But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?
I don't think so. Overfitting = the model was too closely aligned to the training data and can't generalize towards *unseen* data. I think it saw "silence" before, so it's not overfitting but just garbage in, garbage out.
> [By] that definition any incorrect answer can be explained by “overfitting to training data”.
No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.
> Where do you draw the line between “overfitting to training data” and “incorrect data” ?
There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.
Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".
Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.
Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.
I think it's a data quality problem first, which might lead to a sort of overfitting as a consequence.
How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?
It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.
So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.
So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)
It's actually because it is incapable of recognising when it does not know the answer. It will give you the nearest match, even if that is completely incorrect.
And the German is “subtitles of [public broadcaster] for [content network], 2017
I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits
Overfitting would be replicating overly specific details. Like if a specific pattern of silence (or quiet noise) matched to specific copyright notices.
But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.
If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.
Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting
The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:
- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.
- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).
So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".
No. Because there would have been indtances in the data where silence was labelled correctly. But the model couldnt handle the null case, so it over fit on the outros. But generally it fit on the random error in the label of the null feature. Which is what overfitting is
Exactly. Underfitting would be if the model doesn't pick up on the fact that outro silence is labeled differently from regular silence and transcribes them the same
Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.
Which ... would be overfitting. It picks up on a pattern in the training data that we don't want it to pick up on and which causes it to generalize poorly.
How is it overfitting if the data is garbage in the first place? Saying it's overfitting in this context has no meaning as there is no alternative that maximizes the utility function we're training for?
As I didn't see one correct definition of overfitting:
overfitting means that the model is too closely aligned to the test data, picked up noise and does not generalize well to *new, unseen* data. think students that learn to reproduce questions and their answers for a test instead of learning concepts and to transfer knowledge to new questions that include the same concepts.
while this sounds like overfitting, I'd just say it's garbage in, garbage out; wrong classification. the training data is shit and didn't have (enough) correct examples to learn from.
It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5