Ziqian Zhong, Jacob Andreas
The authors present interesting results that shed some light on the inductive biases inherent in the transformer architecture. By learning only the embedding and readout layers on randomly initialised transformers, they are able to train remarkably strong models on a variety of algorithmic tasks including modular addition and parenthesis balancing, which suggests that core elements of these algorithms are available at initialisation (reminiscent of the “lottery ticket” hypothesis that has been applied to deep neural networks).
The authors investigate learning which components among the embedding, positional encoding and readout layers is critical for success, and show the result is task-dependent. For example, decimal addition requires only the embedding layer to be learned, while modular addition also needs learned weights in the readout layer.
In their restricted setting, the authors find that embedding layers in all cases select a low dimensional subspace for downstream processing and this seems to be adequate to achieve perfection on some tasks.
For more complex problems (such as general language modelling) it appears that broader use of the transformer’s latent space is required, which can only be achieved effectively by training the internal weights.
It remains to be investigated how solutions for the restricted models relate to those in which all weights are learned and how training dynamics arrive at those solutions.
Algorithmic Capabilities of Random Transformers