I can't find references to HMM-based large language models. Small HMM language m...

I can't find references to HMM-based large language models. Small HMM language models generate gibberish very similar to this.

A HMM consists of a state space, a state transition matrix, and an output probability matrix. A token space of 50k and a state space of something like 60k would have seemed impossible 10-20 years. It has only recently become viable.

Training using Baum-Welch on a big enough text data set would be interesting. It should be much faster than back-propagation with a transformer-model.