Yushun Zhang, Congliang Chen, Tian Ding , Ziniu Li, Ruoyu Sun , Zhi-Quan Luo
This paper puts forward an explanation for the practical observation that SGD performs well in training some architectures, like CNNs, but performs poorly in training transformer-based models, unlike Adam, which performs well in both cases.
The authors do this by analysing the Hessian of the loss with respect to the weights. From previous works it is known that Hessians of neural networks have a near-block-diagonal structure. While CNNs and transformers have a broadly similar overall Hessian spectrum, the authors find that the diagonal blocks have a similar spectrum in CNNs, but very different spectra in transformers.
Their intuition for this is that CNNs consist of many convolution layers with similar parameter blocks, while transformers consist of many non-sequentially stacked disparate parameter blocks, such as Query-Key-Value blocks, Output projection and MLP layers.
They argue that the transformer situation of ‘block heterogeneity’ benefits from an optimizer that allows for specific treatment of each block, a flexibility offered by Adam but not by SGD. They also propose a measure of block heterogeneity that can be computed on any architecture to decide whether to use Adam or SGD when training.
Finally they give some theoretical results on the performance of SGD and Adam in the case of quadratic models, linking bounds on the performance of SGD to the condition number of the Hessian, and that of Adam to a quantity involving the condition numbers of each diagonal sub-block in the Hessian.
This paper gives an interesting theoretical explanation for an empirical phenomenon, and gives practitioners a tool to decide on an optimizer based on the architecture of their model.
Why Transformers Need Adam: A Hessian Perspective