NeurIPS Paper Reviews 2024 #10

7 February 2025

News
Quantitative Research

Julian, Quantitative Researcher

In this paper review series, our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2024.

Here, discover the perspectives of Quantitative Researcher, Julian.

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Deqing Fu, Tian-qi Chen, Robin Jia, Vatsal Sharan

The authors present a compelling empirical study of in-context learning (ICL) in Transformer models. Their experimental setup is a multi-variate regression task, where they compare out-of-fit predictions across different Transformer layers to the steps of known iterative least squares algorithms.

In their main experiment, the authors fine-tune a GPT-2 architecture—originally pre-trained by Garg et al.—on sequences consisting of pairs of in-fit covariates and targets, followed by single out-of-fit covariate vector. The model is tasked to predict the associated label. By varying the out-of-fit covariate vector while keeping the in-fit samples fixed, one can estimate a vector of induced regression coefficients. To examine these coefficients at each Transformer layer, the authors train a linear readout operation post hoc, enabling direct comparison to the intermediate iterations of classical iterative least squares solvers.

Interestingly, the Transformer’s layer-by-layer regression weights align with the steps of the second-order Newton–Schulz algorithm (up to a constant velocity offset). In particular, they observe a doubly-exponential convergence rate in contrast to exponential rate seen for gradient descent. Moreover, the Transformer’s performance remains robust for ill-conditioned design matrices, mirroring another known feature of the Newton–Schulz method. These findings challenge the widely held claim that Transformers implement ICL primarily by emulating gradient descent updates over their layer stack.

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

NeurIPS 2023 Paper Reviews

Read paper reviews from NeurIPS 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

This paper presents an interesting theoretical argument deriving the benign overfitting regime as a function of the number of independent training samples and the signal-to-noise ratio for the (two-layer) Vision Transformer model (ViT). The authors define a data-generating process which sets (without any loss of generality) the first input patch to an embedding of the target label and uses independent noise samples (that live in the orthogonal complement of the target embeddings) for the remaining patches. The variance of those noise patches determines the signal-to-noise ratio. This setup seems to be a natural adaptation of the data-generating process studied by Cao, Chen, Belkin, and Gu (2022).

The phenomenon of benign overfitting is characterized by achieving (almost) zero loss on both the training and test sets. More formally, this means that for any arbitrarily small positive number, we can find a training iteration achieving a train and test loss below this level. The main theorem of the paper states that the ViT enters this regime with high probability if the number of samples is inversely proportional to the squared signal-to-noise ratio. At the same time, the authors show that if the number of independent samples is too small relative to the signal-to-noise ratio, the model enters the regime of harmful overfitting, where it memorises noise in the training data.

Compared with the two-layer convolutional neural networks studied by Cao et al., the notable finding here is that ViTs can achieve benign overfitting in fewer training iterations.

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Anian Ruoss, Grégoire Delétang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A. Lewis, Joel Veness, Tim Genewein

The paper demonstrates that Transformer models can effectively plan chess moves without relying on explicit search algorithms, which are central to state-of-the-art chess engines. Given the game’s complexity, chess presents a compelling challenge where neural networks cannot succeed through simple memorization of training data.

The dataset used in this study comprises 530 million board states with approximately 15 billion total legal next moves. Each move is labeled with its win probability, as calculated by Stockfish 16, the strongest publicly available chess engine. The authors released this dataset, derived from human games on lichess.org, under the name ChessBench. Board states are represented in FEN notation, which, while omitting the complete move sequence resulting in the given position, allows encoding any board state in a fixed-length context of about 80 tokens. They trained several decoder-only Transformer models, scaling up to 270 million parameters. Evaluation metrics were computed on a test set of around 10,000 chess puzzles from lichess.org, covering a wide range of Elo scores (a measure of puzzle difficulty).

The authors conducted an extensive ablation study with the following findings:

Prediction Targets: They explored three targets; the last one performed best:
- Classifying the best legal move for a given position.
- Predicting win probabilities for board states, indirectly yielding probabilities for legal next moves.
- Directly predicting the win probability of each legal move by including candidate moves in the model’s context.
Label Smoothing and Binning: Binning probability predictions and applying label smoothing improved performance.
Model Depth: Increasing the network depth enhanced performance (up to a limit). The authors attribute that to improved iterative computations.

Remarkably, their biggest Transformer model performed almost on par with the AlphaZero chess engine (with half as many Monte Carlo tree search simulations as in the original paper) on the puzzle test set. The model was also tested on lichess.org, where it attained Elo scores of 2,300 against bots and 2,900 against humans. According to the authors, the discrepancy arises because Stockfish assigns a win probability of 100% to positions with a forced checkmate in n moves, while the Transformer lacks the ability to plan such strategies explicitly. This often resulted in accidental draws or even blunders resulting in ultimate losses. Human opponents tended to resign in losing positions, while bots played until checkmate.

A less encouraging result is the significant drop in playing strength when applying the model out-of-distribution to Fisher random chess (there the back row of the starting position is shuffled under the constraint that some basic rules of the game are observed, e.g., the ability to castle and opposite coloured bishops). In contrast, conventional chess engines are known to have little problems adapting to this variation.

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Latest News

Invisible Work of OpenStack: Eventlet Migration

25 Mar 2025

Hear from Jay, an Open Source Software Engineer, on tackling technical debt in OpenStack. As technology evolves, outdated code becomes inefficient and harder to maintain. Jay highlights the importance of refactoring legacy systems to keep open-source projects sustainable and future-proof.

Read article

SXSW 2025: Key takeaways from our Engineers

24 Mar 2025

At G-Research we stay at the cutting edge by prioritising learning and development. That’s why we encourage our people to attend events like SXSW, where they can engage with industry experts and explore new ideas. Hear from two Dallas-based Engineers, as they share their key takeaways from SXSW 2025.

Read article

G-Research February 2025 Grant Winners

17 Mar 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our February grant winners.

Read article

Latest Events

Quantitative Engineering
Quantitative Research

Women in Quant Finance

15 Jun 2025 - 16 Jun 2025 1 Soho Place, London, W1D 3BG

More info

Quantitative Engineering
Quantitative Research

Pub Quiz: Paris

15 May 2025 Paris - to be confirmed after registration

More info

Quantitative Engineering
Quantitative Research

Stanford Quant Challenge

30 Apr 2025 Sheraton Palo Alto Hotel, 625 El Camino Real, Palo Alto, CA 94301, US

More info

NeurIPS Paper Reviews 2024 #10

Julian, Quantitative Researcher

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Read more paper reviews

Latest News

Latest Events

Women in Quant Finance

Pub Quiz: Paris

Stanford Quant Challenge

Stay up to date with
G-Research

NeurIPS Paper Reviews 2024 #10

Julian, Quantitative Researcher

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Read more paper reviews

Latest News

Latest Events

Women in Quant Finance

Pub Quiz: Paris

Stanford Quant Challenge

Stay up to date with G-Research

Stay up to date with
G-Research