Transformers are exciting as they seem to work on all types of modalities, including vision [1]. It makes me wonder if the transformer module captures some essence of the minicolumn structure found all over the neocortex that Jeff Hawkins raves about, citing Vernon Mountcastle. Hawkins et al talk about grid cells nowadays, location; maybe attention and context is the generalization of such a notion.
The hierarchical transformers variants are uncovering some possible optimizations that are similar to the ideas of Thousand Brains - https://arxiv.org/abs/2110.13711
The attention mechanisms, in conjunction with autoencoding, create a rough approximation of what grid cells accomplish, but transformers are still a feedforward architecture. Thanks to Moore's law, we can expand the scale of inputs to achieve human like performance, but until someone untangles the structure and devises a way of including recurrence, transformers won't be able to perform all of the functions assumed by Hawkins.
There are interesting lstm variations on transformers, but nothing public yet that really performs at the same level as the straight feedforward models. Combinatorial explosion is a bitch and lstm explodes the size and compute requirements. Hierarchical structures could constrain the requirements to something achievable.
With recurrence, you can begin to train models to perform things like discrete mathematics, as opposed to the relatively shallow semantic graphs in gpt-3 like models. The models right now don't have anything stateful that could be called memory, but with recurrence, model states will be dynamic encodings that can be processed over many cycles.
[1] https://arxiv.org/abs/2105.15203