Deqing Fu, Tian-qi Chen, Robin Jia, Vatsal Sharan
The authors present a compelling empirical study of in-context learning (ICL) in Transformer models. Their experimental setup is a multi-variate regression task, where they compare out-of-fit predictions across different Transformer layers to the steps of known iterative least squares algorithms.
In their main experiment, the authors fine-tune a GPT-2 architecture—originally pre-trained by Garg et al.—on sequences consisting of pairs of in-fit covariates and targets, followed by single out-of-fit covariate vector. The model is tasked to predict the associated label. By varying the out-of-fit covariate vector while keeping the in-fit samples fixed, one can estimate a vector of induced regression coefficients. To examine these coefficients at each Transformer layer, the authors train a linear readout operation post hoc, enabling direct comparison to the intermediate iterations of classical iterative least squares solvers.
Interestingly, the Transformer’s layer-by-layer regression weights align with the steps of the second-order Newton–Schulz algorithm (up to a constant velocity offset). In particular, they observe a doubly-exponential convergence rate in contrast to exponential rate seen for gradient descent. Moreover, the Transformer’s performance remains robust for ill-conditioned design matrices, mirroring another known feature of the Newton–Schulz method. These findings challenge the widely held claim that Transformers implement ICL primarily by emulating gradient descent updates over their layer stack.
Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression