Skip to main content
Back to News
NeurIPS Paper Reviews 2024 #2

NeurIPS Paper Reviews 2024 #2

23 January 2025
  • News
  • Quantitative Research

Trenton - Software Engineer

In this paper review series, our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2024.

Here, discover the perspectives of Software Engineer, Trenton.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

With an already established impact in the community and the speed at which the field is moving, seeing FlashAttention-3 at NeurIPS 2024 felt almost nostalgic. Nevertheless, its significant advances in computational efficiency for scaling transformer models means it was a well-deserved spotlight poster.

The paper demonstrates how algorithms need to be benchmarked and adapted when new hardware is released. FlashAttention-2 was highly efficient when written, achieving substantial speedups on Ampere GPUs, but the authors discovered that it did not translate well to Hopper GPUs. On an Nvidia H100, FlashAttention-2 only achieves 35% utilisation, meaning it is not very efficient.

The paper suggests three key changes, exploiting the asynchrony of Tensor Cores and Tensor Memory Access (TMA), that raise GPU utilisation to 75-85%:

  1. Overlapping overall computation and data movement via warp-specialisation
  2. Interleaving block-wise matrix multiplication and softmax operations
  3. Block quantisation and incoherent processing that leverages hardware support for FP8 low-precision

These adjustments reduce memory bottlenecks and better leverage hardware potential, giving practitioners who use transformers faster training, faster inference and the ability to handle longer contexts. The authors demonstrate how, by thinking about hardware constraints and applying software engineering principles, models can be optimised without changing their core architecture. With hardware innovations always on the horizon, I’m excited to see what future iterations will bring.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
NeurIPS 2023 Paper Reviews

Read paper reviews from NeurIPS 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim

Parallelizing Linear Transformers with the Delta Rule over Sequence Length presents a new algorithm designed to enhance the performance and scalability of linear transformers. While linear transformers and state-space models have emerged as potential alternatives to traditional transformers with softmax attention, they still face limitations, particularly in tasks that require in-context retrieval.

The paper focuses on improving DeltaNet, a more expressive variant of linear transformers that uses the delta rule instead of the additive update found in traditional linear transformers. DeltaNet has been shown to improve associative recall but is hindered by lack of parallelization over sequence length in training.

The authors introduce a hardware-efficient algorithm for training these models by leveraging a memory-efficient representation for computing products of Householder matrices, which significantly reduces memory consumption and computational overhead.

Using the algorithm, the authors train a 1.3B parameter model on 100B tokens and show that it outperforms existing linear-time baselines such as Mamba and GLA in terms of both perplexity and zero-shot performance on downstream tasks.

Additionally, the paper explores two hybrid models that combine DeltaNet layers with either sliding-window attention or global attention layers, showing that these hybrids can outperform strong transformer baselines. This work provides a promising step forward in making linear transformers more practical and efficient for large-scale applications, potentially paving the way for more scalable models in the future.

Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia

RL-GPT introduces an innovative integration of Large Language Models (LLMs) and Reinforcement Learning (RL), offering a new approach to solving embodied tasks, or tasks that involve interacting with an environment.

While LLMs excel at generating high-level plans and understanding language, they often fall short when it comes to executing these plans with precision in dynamic, real-world scenarios.

Existing methods usually tackle this by fine-tuning LLMs with RL or using hierarchical RL structures. However, the authors of RL-GPT take a different approach: they treat RL as a tool for optimising precision in task-specific low-level actions, rather than relying on it for the entire process.

The paper introduces a two-level hierarchical framework consisting of a slow agent and a fast agent, trained in a two-loop iteration with a third critic agent to provide feedback. The slow agent is responsible for analysing high-level actions that are suitable for coding, while the fast agent creates code-as-policy for the RL action space. Their results outperform both traditional RL methods and previous GPT-based approaches on the MineDojo benchmark, a Minecraft-based testbed for evaluating embodied tasks.

What makes this work particularly thought-provoking is its broader implications for the future of AI systems. It shows that a combination of methods can form a larger system, where LLMs are involved but language isn’t the only interface for interaction, capable of handling a wider range of tasks. It raises the question of what applications will emerge as these methods continue to improve.

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Read more paper reviews

ICML 2024: Paper Review #1

Discover the perspectives of Casey, one of our Machine Learning Engineer, on the following papers:

  • Towards scalable and stable parallelization of nonlinear RNNs
  • logarithmic math in accurate and efficient AI inference accelerators
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Mark, one of our Senior Quantitative Researcher, on the following papers:

  • Why Transformers Need Adam: A Hessian Perspective
  • Poisson Variational Autoencoder
  • Noether’s Razor: Learning Conserved Quantities
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Angus, one of our Machine Learning Engineer, on the following papers:

  • einspace: Searching for Neural Architectures from Fundamental Operations
  • SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
  • An Image is Worth 32 Tokens for Reconstruction and Generation
  • Dimension-free deterministic equivalents and scaling laws for random feature regression
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Georg, one of our Quant Research Manager, on the following papers:

  • Optimal Parallelization of Boosting
  • Learning Formal Mathematics From Intrinsic Motivation
  • Learning on Large Graphs using Intersecting Communities
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Cedric, one of our Quantitative Researchers, on the following papers:

  • Preference Alignment with Flow Matching
  • A Generative Model of Symmetry Transformations
Read now
ICML 2024: Paper Review #8

Discover the perspectives of Hugh, one of our Scientific Director, on the following papers:

  • Better by default: Strong pre-tuned MLPs and boosted trees on tabular data
  • Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data
Read now
ICML 2024: Paper Review #9

Discover the perspectives of Andrew, one of our Quant Research Managers, on the following papers:

  • Algorithmic Capabilities of Random Transformers
  • The Road Less Scheduled
  • Time Series in the Age of Large Models
Read now
ICML 2024: Paper Review #10

Discover the perspectives of Julian, one of our Quantitative Researchers, on the following papers:

  • Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression
  • Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization
  • Amortized Planning with Large-Scale Transformers: A Case Study on Chess
Read now

Latest News

G-Research 2024 PhD prize winners: SOCINT
  • 11 Mar 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with Società Italiana di Intelligence.

Read article
G-Research Scholarships: We’re fully funding 42 PhD students
  • 25 Feb 2025

We’re thrilled to announce the launch of a brand-new Scholarships programme, fully-funding 42 PhD students across the UK through our NextGen programme.

Read article
G-Research January 2025 Grant Winners
  • 24 Feb 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our January grant winners.

Read article

Latest Events

  • Platform Engineering
  • Software Engineering

Warsaw Coding Challenge

18 Mar 2025 Hotel Bristol, Krakowskie Przedmiescie 42/44, 00-325 Warsaw
  • Platform Engineering
  • Software Engineering

Belgrade Coding Challenge

20 Mar 2025 Saint Ten Hotel, Svetog Save 10, Beograd 11000, Serbia
  • Quantitative Engineering
  • Quantitative Research

London Quant Challenge

19 Mar 2025 G-Research, 1 Soho Place, London, W1D 3BG

Stay up to date with
G-Research