Skip to main content
Back to News
NeurIPS Paper Reviews 2024 #3

NeurIPS Paper Reviews 2024 #3

23 January 2025
  • News
  • Quantitative Research

Mark, Senior Quantitative Researcher

In this paper review series, our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2024.

Here, discover the perspectives of Senior Quantitative Researcher, Mark.

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang, Congliang Chen, Tian Ding , Ziniu Li, Ruoyu Sun , Zhi-Quan Luo

This paper puts forward an explanation for the practical observation that SGD performs well in training some architectures, like CNNs, but performs poorly in training transformer-based models, unlike Adam, which performs well in both cases.

The authors do this by analysing the Hessian of the loss with respect to the weights. From previous works it is known that Hessians of neural networks have a near-block-diagonal structure. While CNNs and transformers have a broadly similar overall Hessian spectrum, the authors find that the diagonal blocks have a similar spectrum in CNNs, but very different spectra in transformers.

Their intuition for this is that CNNs consist of many convolution layers with similar parameter blocks, while transformers consist of many non-sequentially stacked disparate parameter blocks, such as Query-Key-Value blocks, Output projection and MLP layers.

They argue that the transformer situation of ‘block heterogeneity’ benefits from an optimizer that allows for specific treatment of each block, a flexibility offered by Adam but not by SGD. They also propose a measure of block heterogeneity that can be computed on any architecture to decide whether to use Adam or SGD when training.

Finally they give some theoretical results on the performance of SGD and Adam in the case of quadratic models, linking bounds on the performance of SGD to the condition number of the Hessian, and that of Adam to a quantity involving the condition numbers of each diagonal sub-block in the Hessian.

This paper gives an interesting theoretical explanation for an empirical phenomenon, and gives practitioners a tool to decide on an optimizer based on the architecture of their model.

Why Transformers Need Adam: A Hessian Perspective
NeurIPS 2023 Paper Reviews

Read paper reviews from NeurIPS 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Poisson Variational Autoencoder

Hadi Vafaii, Dekel Galor, Jacob L. Yates

The authors of this paper are interested in the correspondence between neural networks and biological brains. They argue that variational autoencoders (VAEs) are a promising model for perception in the brain because they incorporate the idea of perceptual inference and learn representations similar to the cortex.

However, they possess some key differences; in particular, biological neurons fire discretely, with the rate of firing thought to encode information, while VAEs typically use continuous distributions in the latent space.

This motivates them to create the Poisson Variational Autoencoder (P-VAE), which uses the discrete Poisson distribution in its latent space. Their main contribution is handling the problem of performing inference over discrete latent Poisson-distributed variables: they consider the Poisson distribution of rate x by counting the number of observations in unit time of a stochastic process with wait times distributed exponentially with mean 1/x, and by replacing the hard threshold function with a sigmoid.

They derive the resulting loss function and show its similarity to the loss function of sparse coding, with a specific ‘metabolic cost term’ penalizing firing rates.

In practical experiments on the Van Hateren natural image database, CIFAR 16×16, and MNIST, they observe that the P-VAE:

  • (i) learns basis vectors similar to sparse encoding
  • (ii) avoids posterior collapse
  • (iii) learns sparse representations
  • (iv) is more sample efficient in downstream tasks than Gaussian-based VAEs
  • (v) learns representations with higher dimensional geometry

This work shows that some of the shortcomings in using VAEs to model biological neurons can be overcome.

Poisson Variational Autoencoder
Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Noether’s Razor: Learning Conserved Quantities

Tycho F. A. van der Ouderaa, Mark van der Wilk, Pim de Haan

Many physical systems exhibit symmetries. When modelling the behaviour of such systems using neural networks, generalisation and overall performance are often improved by incorporating such symmetries; a classic example being how CNNs incorporate translational symmetries.

This paper proposes a new method to learn these by parameterising symmetries in Hamiltonian machine learning models and connecting them with conserved quantities via Noether’s theorem.

Through an ODE flow, each conserved quantity generates a one-dimensional subgroup of diffeomorphisms to which the Hamiltonian is invariant. Restricting to quadratic conserved symmetries, the authors find a closed form for this flow in terms of a matrix exponential. They use this flow to create the orbits of the group action, and symmetrise observables by approximately integrating over these orbits via a Monte Carlo simulation.

With this in place, they use the marginal likelihood of the data given parametrized symmetries to fit models. The marginal likelihood has to strike a balance between fit to the data and model complexity, favouring simpler symmetric models and giving rise to an Occam’s razor effect.

Since the marginal likelihood is generally intractable in closed-form, they use Bayesian inference to get a lower bound that incorporates the symmetrised Hamiltonian and use this as a loss function.

Using matrix normal posteriors factorised per layer to reduce the number of variational parameters, they train models to different physical systems with known symmetries, finding that their model recovers the correct symmetries and outperforms non-symmetric models in generalization.

The paper shows that when modelling Hamiltonian systems, one does not have to choose between imposing or not imposing symmetries on a model, but that the symmetries can be learned, both in theory and in practice.

Noether's razor: Learning Conserved Quantities

Read more paper reviews

ICML 2024: Paper Review #1

Discover the perspectives of Casey, one of our Machine Learning Engineer, on the following papers:

  • Towards scalable and stable parallelization of nonlinear RNNs
  • logarithmic math in accurate and efficient AI inference accelerators
Read now
ICML 2024: Paper Review #2

Discover the perspectives of Trenton, one of our Software Engineer, on the following papers:

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
  • Parallelizing Linear Transformers with the Delta Rule over Sequence Length
  • RL-GPT: Integrating Reinforcement Learning and Code-as-policy
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Angus, one of our Machine Learning Engineer, on the following papers:

  • einspace: Searching for Neural Architectures from Fundamental Operations
  • SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
  • An Image is Worth 32 Tokens for Reconstruction and Generation
  • Dimension-free deterministic equivalents and scaling laws for random feature regression
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Georg, one of our Quant Research Manager, on the following papers:

  • Optimal Parallelization of Boosting
  • Learning Formal Mathematics From Intrinsic Motivation
  • Learning on Large Graphs using Intersecting Communities

Coming Soon

ICML 2024: Paper Review #7

Discover the perspectives of Cedric, one of our Quantitative Researchers, on the following papers:

  • Preference Alignment with Flow Matching
  • A Generative Model of Symmetry Transformations

Coming Soon

ICML 2024: Paper Review #8

Discover the perspectives of Hugh, one of our Scientific Director, on the following papers:

  • Better by default: Strong pre-tuned MLPs and boosted trees on tabular data
  • Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data

Coming Soon

ICML 2024: Paper Review #9

Discover the perspectives of Andrew, one of our Quant Research Managers, on the following papers:

  • Algorithmic Capabilities of Random Transformers
  • The Road Less Scheduled
  • Time Series in the Age of Large Models

Coming Soon

Latest News

NeurIPS Paper Reviews 2024 #5
  • 23 Jan 2025

In this NeurIPS paper review series, Dustin, Scientific Director, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article
NeurIPS Paper Reviews 2024 #4
  • 23 Jan 2025

In this NeurIPS paper review series, Angus, Machine Learning Engineer, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article
NeurIPS Paper Reviews 2024 #3
  • 23 Jan 2025

In this NeurIPS paper review series, Mark, Senior Quantitative Researcher, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article

Latest Events

  • Platform Engineering
  • Software Engineering

Hack the Burgh

01 Mar 2025 - 02 Mar 2025 The Nucleus Building, The University of Edinburgh, Thomas Bayes Road, Edinburgh, UK
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Oxford

12 Feb 2025 Oxford - to be confirmed after registration
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Cambridge

25 Feb 2025 Cambridge - to be confirmed after registration

Stay up to date with
G-Research