>> Encoded into this somewhat unstructured data is the very thing NLP is after: the meaning.
The problem is that we don't know how meaning is encoded into language utterances and we don't know how meaning is represented, once it's decoded from those utterances (i.e. in our minds). It's very unlikely that these elements, the encoding process that turns meaning to language and back again and the reprsentation of meaning, are carried around in language utterances themselves [1]. And yet- we keep trying to figure out both, the encoding process and the representation, just by looking at the encoded utterances.
Imagine having a compressed string and trying to figure out a) the compression algorithm and b) the uncompressed string, withouth having ever seen examples of either. That's what natural language understanding from raw, unstructured text is like.
______________
[1] Edit: Is that even possible? Is it possible to send an encoded message including its own encoding procedure, so that the message can be decoded even when the procedure is not known beforehand? Wouldn't that require that the procedure is somehow possible to decode independently of the message? Is there another way?
>> we keep trying to figure out both, the encoding process and the representation, just by looking at the encoded utterances
I think what we're trying to do or what I'm trying to do at least is to find a model that would produce the same interpretations of an utterance as a human would. I don't see why we couldn't find such a model pretty soon, given the vast amount of data out there, however unstructured it might be.
I don't think what's stopping us from modelling meaning is the lack of structure in textual data. I think it's the fact that text is not meaning. It somehow encodes meaning, but we don't know how and we don't know what a "unit of meaning" is supposed to look like.
When you say that we have "vast amounts of data" you mean that we have vast amounts of text- but by modelling text we will not model meaning, we will only model text. We have not observed "meaning" and we have no examples of meaning turning into text and back again.
If I may be allowed the simile, training on text to model meaning is a bit like looking at a screen hiding a figure of a person and trying to learn something about the person from the screen they're hiding behind.
The problem is that we don't know how meaning is encoded into language utterances and we don't know how meaning is represented, once it's decoded from those utterances (i.e. in our minds). It's very unlikely that these elements, the encoding process that turns meaning to language and back again and the reprsentation of meaning, are carried around in language utterances themselves [1]. And yet- we keep trying to figure out both, the encoding process and the representation, just by looking at the encoded utterances.
Imagine having a compressed string and trying to figure out a) the compression algorithm and b) the uncompressed string, withouth having ever seen examples of either. That's what natural language understanding from raw, unstructured text is like.
______________
[1] Edit: Is that even possible? Is it possible to send an encoded message including its own encoding procedure, so that the message can be decoded even when the procedure is not known beforehand? Wouldn't that require that the procedure is somehow possible to decode independently of the message? Is there another way?