What do you think an end-to-end solution would look like?
I missed the Layernorm squashing bit, I skimmed that part when I read the paper. I'll re-read. From your description, I agree that's a problem. I wonder if closer to a floating point / exponential representation would be better, if it let you not squash - e.g. it would become a great order-of-magnitude numerist.
For me, an end-to-end solution would look like a model that can 1) read numbers in any format from the string, 2) manipulate and reason about them correctly, and 3) reliably output structured data if requested. No need for special encodings. Obviously, we're nowhere near that!
In the interim, I agree with you that a representation that explicitly encodes a mantissa and an exponent seems like a better stopgap solution. We human beings already do that to some extent -- e.g., we often think in terms of tens, hundreds, thousands, and so on, as numbers get bigger, and in terms of percents, basis points, and decimal points as they get smaller.
I wouldn't be surprised if large models internally will eventually learn implicitly to represent numbers as mantissas and exponents.
I missed the Layernorm squashing bit, I skimmed that part when I read the paper. I'll re-read. From your description, I agree that's a problem. I wonder if closer to a floating point / exponential representation would be better, if it let you not squash - e.g. it would become a great order-of-magnitude numerist.