The issue with transformers is the context length. Compute wise, we can figure o...

The issue with transformers is the context length. Compute wise, we can figure out the long context window (in terms of figuring out the attention matrix and doing the calculations). The issue is training. The weights are specialized to deal with contexts only of a certain size. As far as I know, there's no surefire solution that can overcome this. But theoretically, if you were okay with the quadratic explosion (and had a good dataset, another point...) you could spend money and train it for much longer context lengths. I think for a full project you'd need millions of tokens.