Skip to main content
Back to News
NeurIPS Paper Reviews 2024 #10

NeurIPS Paper Reviews 2024 #10

7 February 2025
  • News
  • Quantitative Research

Julian, Quantitative Researcher

In this paper review series, our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2024.

Here, discover the perspectives of Quantitative Researcher, Julian.

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Deqing Fu, Tian-qi Chen, Robin Jia, Vatsal Sharan

The authors present a compelling empirical study of in-context learning (ICL) in Transformer models. Their experimental setup is a multi-variate regression task, where they compare out-of-fit predictions across different Transformer layers to the steps of known iterative least squares algorithms.

In their main experiment, the authors fine-tune a GPT-2 architecture—originally pre-trained by Garg et al.—on sequences consisting of pairs of in-fit covariates and targets, followed by single out-of-fit covariate vector. The model is tasked to predict the associated label. By varying the out-of-fit covariate vector while keeping the in-fit samples fixed, one can estimate a vector of induced regression coefficients. To examine these coefficients at each Transformer layer, the authors train a linear readout operation post hoc, enabling direct comparison to the intermediate iterations of classical iterative least squares solvers.

Interestingly, the Transformer’s layer-by-layer regression weights align with the steps of the second-order Newton–Schulz algorithm (up to a constant velocity offset). In particular, they observe a doubly-exponential convergence rate in contrast to exponential rate seen for gradient descent. Moreover, the Transformer’s performance remains robust for ill-conditioned design matrices, mirroring another known feature of the Newton–Schulz method. These findings challenge the widely held claim that Transformers implement ICL primarily by emulating gradient descent updates over their layer stack.

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression
NeurIPS 2023 Paper Reviews

Read paper reviews from NeurIPS 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

This paper presents an interesting theoretical argument deriving the benign overfitting regime as a function of the number of independent training samples and the signal-to-noise ratio for the (two-layer) Vision Transformer model (ViT). The authors define a data-generating process which sets (without any loss of generality) the first input patch to an embedding of the target label and uses independent noise samples (that live in the orthogonal complement of the target embeddings) for the remaining patches. The variance of those noise patches determines the signal-to-noise ratio. This setup seems to be a natural adaptation of the data-generating process studied by Cao, Chen, Belkin, and Gu (2022).

The phenomenon of benign overfitting is characterized by achieving (almost) zero loss on both the training and test sets. More formally, this means that for any arbitrarily small positive number, we can find a training iteration achieving a train and test loss below this level. The main theorem of the paper states that the ViT enters this regime with high probability if the number of samples is inversely proportional to the squared signal-to-noise ratio. At the same time, the authors show that if the number of independent samples is too small relative to the signal-to-noise ratio, the model enters the regime of harmful overfitting, where it memorises noise in the training data.

Compared with the two-layer convolutional neural networks studied by Cao et al., the notable finding here is that ViTs can achieve benign overfitting in fewer training iterations.

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization
Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Anian Ruoss, Grégoire Delétang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A. Lewis, Joel Veness, Tim Genewein

The paper demonstrates that Transformer models can effectively plan chess moves without relying on explicit search algorithms, which are central to state-of-the-art chess engines. Given the game’s complexity, chess presents a compelling challenge where neural networks cannot succeed through simple memorization of training data.

The dataset used in this study comprises 530 million board states with approximately 15 billion total legal next moves. Each move is labeled with its win probability, as calculated by Stockfish 16, the strongest publicly available chess engine. The authors released this dataset, derived from human games on lichess.org, under the name ChessBench. Board states are represented in FEN notation, which, while omitting the complete move sequence resulting in the given position, allows encoding any board state in a fixed-length context of about 80 tokens. They trained several decoder-only Transformer models, scaling up to 270 million parameters. Evaluation metrics were computed on a test set of around 10,000 chess puzzles from lichess.org, covering a wide range of Elo scores (a measure of puzzle difficulty).

The authors conducted an extensive ablation study with the following findings:

  1. Prediction Targets: They explored three targets; the last one performed best:
    • Classifying the best legal move for a given position.
    • Predicting win probabilities for board states, indirectly yielding probabilities for legal next moves.
    • Directly predicting the win probability of each legal move by including candidate moves in the model’s context.
  2. Label Smoothing and Binning: Binning probability predictions and applying label smoothing improved performance.
  3. Model Depth: Increasing the network depth enhanced performance (up to a limit). The authors attribute that to improved iterative computations.

Remarkably, their biggest Transformer model performed almost on par with the AlphaZero chess engine (with half as many Monte Carlo tree search simulations as in the original paper) on the puzzle test set. The model was also tested on lichess.org, where it attained Elo scores of 2,300 against bots and 2,900 against humans. According to the authors, the discrepancy arises because Stockfish assigns a win probability of 100% to positions with a forced checkmate in n moves, while the Transformer lacks the ability to plan such strategies explicitly. This often resulted in accidental draws or even blunders resulting in ultimate losses. Human opponents tended to resign in losing positions, while bots played until checkmate.

A less encouraging result is the significant drop in playing strength when applying the model out-of-distribution to Fisher random chess (there the back row of the starting position is shuffled under the constraint that some basic rules of the game are observed, e.g., the ability to castle and opposite coloured bishops). In contrast, conventional chess engines are known to have little problems adapting to this variation.

Amortized Planning with Large-Scale Transformers: A Case Study on Chess

Read more paper reviews

ICML 2024: Paper Review #1

Discover the perspectives of Casey, one of our Machine Learning Engineer, on the following papers:

  • Towards scalable and stable parallelization of nonlinear RNNs
  • logarithmic math in accurate and efficient AI inference accelerators
Read now
ICML 2024: Paper Review #2

Discover the perspectives of Trenton, one of our Software Engineer, on the following papers:

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
  • Parallelizing Linear Transformers with the Delta Rule over Sequence Length
  • RL-GPT: Integrating Reinforcement Learning and Code-as-policy
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Mark, one of our Senior Quantitative Researcher, on the following papers:

  • Why Transformers Need Adam: A Hessian Perspective
  • Poisson Variational Autoencoder
  • Noether’s Razor: Learning Conserved Quantities
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Angus, one of our Machine Learning Engineer, on the following papers:

  • einspace: Searching for Neural Architectures from Fundamental Operations
  • SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
  • An Image is Worth 32 Tokens for Reconstruction and Generation
  • Dimension-free deterministic equivalents and scaling laws for random feature regression
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Georg, one of our Quant Research Manager, on the following papers:

  • Optimal Parallelization of Boosting
  • Learning Formal Mathematics From Intrinsic Motivation
  • Learning on Large Graphs using Intersecting Communities
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Cedric, one of our Quantitative Researchers, on the following papers:

  • Preference Alignment with Flow Matching
  • A Generative Model of Symmetry Transformations
Read now
ICML 2024: Paper Review #8

Discover the perspectives of Hugh, one of our Scientific Directors, on the following papers:

  • Better by default: Strong pre-tuned MLPs and boosted trees on tabular data
  • Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data
Read now
ICML 2024: Paper Review #9

Discover the perspectives of Andrew, one of our Quant Research Managers, on the following papers:

  • Algorithmic Capabilities of Random Transformers
  • The Road Less Scheduled
  • Time Series in the Age of Large Models
Read now

Latest News

Invisible Work of OpenStack: Eventlet Migration
  • 25 Mar 2025

Hear from Jay, an Open Source Software Engineer, on tackling technical debt in OpenStack. As technology evolves, outdated code becomes inefficient and harder to maintain. Jay highlights the importance of refactoring legacy systems to keep open-source projects sustainable and future-proof.

Read article
SXSW 2025: Key takeaways from our Engineers
  • 24 Mar 2025

At G-Research we stay at the cutting edge by prioritising learning and development. That’s why we encourage our people to attend events like SXSW, where they can engage with industry experts and explore new ideas. Hear from two Dallas-based Engineers, as they share their key takeaways from SXSW 2025.

Read article
G-Research February 2025 Grant Winners
  • 17 Mar 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our February grant winners.

Read article

Latest Events

  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Paris

15 May 2025 Paris - to be confirmed after registration
  • Quantitative Engineering
  • Quantitative Research

Stanford Quant Challenge

30 Apr 2025 Sheraton Palo Alto Hotel, 625 El Camino Real, Palo Alto, CA 94301, US
  • Quantitative Engineering
  • Quantitative Research

Berkeley Quant Challenge

29 Apr 2025 University of California, Berkeley, California, US

Stay up to date with
G-Research