Back to news

NeurIPS Paper Reviews 2024 #3

23 January 2025

News
Quantitative Research

Mark, Senior Quantitative Researcher

In this paper review series, our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2024.

Here, discover the perspectives of Senior Quantitative Researcher, Mark.

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang, Congliang Chen, Tian Ding , Ziniu Li, Ruoyu Sun , Zhi-Quan Luo

This paper puts forward an explanation for the practical observation that SGD performs well in training some architectures, like CNNs, but performs poorly in training transformer-based models, unlike Adam, which performs well in both cases.

The authors do this by analysing the Hessian of the loss with respect to the weights. From previous works it is known that Hessians of neural networks have a near-block-diagonal structure. While CNNs and transformers have a broadly similar overall Hessian spectrum, the authors find that the diagonal blocks have a similar spectrum in CNNs, but very different spectra in transformers.

Their intuition for this is that CNNs consist of many convolution layers with similar parameter blocks, while transformers consist of many non-sequentially stacked disparate parameter blocks, such as Query-Key-Value blocks, Output projection and MLP layers.

They argue that the transformer situation of ‘block heterogeneity’ benefits from an optimizer that allows for specific treatment of each block, a flexibility offered by Adam but not by SGD. They also propose a measure of block heterogeneity that can be computed on any architecture to decide whether to use Adam or SGD when training.

Finally they give some theoretical results on the performance of SGD and Adam in the case of quadratic models, linking bounds on the performance of SGD to the condition number of the Hessian, and that of Adam to a quantity involving the condition numbers of each diagonal sub-block in the Hessian.

This paper gives an interesting theoretical explanation for an empirical phenomenon, and gives practitioners a tool to decide on an optimizer based on the architecture of their model.

Why Transformers Need Adam: A Hessian Perspective

NeurIPS 2023 Paper Reviews

Read paper reviews from NeurIPS 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Poisson Variational Autoencoder

Hadi Vafaii, Dekel Galor, Jacob L. Yates

The authors of this paper are interested in the correspondence between neural networks and biological brains. They argue that variational autoencoders (VAEs) are a promising model for perception in the brain because they incorporate the idea of perceptual inference and learn representations similar to the cortex.

However, they possess some key differences; in particular, biological neurons fire discretely, with the rate of firing thought to encode information, while VAEs typically use continuous distributions in the latent space.

This motivates them to create the Poisson Variational Autoencoder (P-VAE), which uses the discrete Poisson distribution in its latent space. Their main contribution is handling the problem of performing inference over discrete latent Poisson-distributed variables: they consider the Poisson distribution of rate x by counting the number of observations in unit time of a stochastic process with wait times distributed exponentially with mean 1/x, and by replacing the hard threshold function with a sigmoid.

They derive the resulting loss function and show its similarity to the loss function of sparse coding, with a specific ‘metabolic cost term’ penalizing firing rates.

In practical experiments on the Van Hateren natural image database, CIFAR 16×16, and MNIST, they observe that the P-VAE:

(i) learns basis vectors similar to sparse encoding
(ii) avoids posterior collapse
(iii) learns sparse representations
(iv) is more sample efficient in downstream tasks than Gaussian-based VAEs
(v) learns representations with higher dimensional geometry

This work shows that some of the shortcomings in using VAEs to model biological neurons can be overcome.

Poisson Variational Autoencoder

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Noether’s Razor: Learning Conserved Quantities

Tycho F. A. van der Ouderaa, Mark van der Wilk, Pim de Haan

Many physical systems exhibit symmetries. When modelling the behaviour of such systems using neural networks, generalisation and overall performance are often improved by incorporating such symmetries; a classic example being how CNNs incorporate translational symmetries.

This paper proposes a new method to learn these by parameterising symmetries in Hamiltonian machine learning models and connecting them with conserved quantities via Noether’s theorem.

Through an ODE flow, each conserved quantity generates a one-dimensional subgroup of diffeomorphisms to which the Hamiltonian is invariant. Restricting to quadratic conserved symmetries, the authors find a closed form for this flow in terms of a matrix exponential. They use this flow to create the orbits of the group action, and symmetrise observables by approximately integrating over these orbits via a Monte Carlo simulation.

With this in place, they use the marginal likelihood of the data given parametrized symmetries to fit models. The marginal likelihood has to strike a balance between fit to the data and model complexity, favouring simpler symmetric models and giving rise to an Occam’s razor effect.

Since the marginal likelihood is generally intractable in closed-form, they use Bayesian inference to get a lower bound that incorporates the symmetrised Hamiltonian and use this as a loss function.

Using matrix normal posteriors factorised per layer to reduce the number of variational parameters, they train models to different physical systems with known symmetries, finding that their model recovers the correct symmetries and outperforms non-symmetric models in generalization.

The paper shows that when modelling Hamiltonian systems, one does not have to choose between imposing or not imposing symmetries on a model, but that the symmetries can be learned, both in theory and in practice.

Noether's razor: Learning Conserved Quantities

Latest News

G-Research May 2025 Grant Winners

18 Jun 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our May grant winners.

Read article

G-Research 2025 PhD prize winners: University of Warwick

04 Jun 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with the University of Warwick.

Read article

G-Research 2025 PhD prize winners: University of Oxford

29 May 2025

Read article

Latest Events

Quantitative Engineering
Quantitative Research

OxML 2025

08 Aug 2025 University of Oxford, Radcliffe Observatory, Andrew Wiles Building, Woodstock Rd, Oxford, OX2 6GG

More info

Quantitative Engineering
Quantitative Research

G-Research networking drinks at EuroPython 2025

16 Jul 2025 Shared on confirmation of your place

More info

Quantitative Engineering
Quantitative Research

ML in PL Conference 2025

15 Oct 2025 - 18 Oct 2025 Copernicus Science Centre, Warsaw, Poland

More info

NeurIPS Paper Reviews 2024 #3

Mark, Senior Quantitative Researcher

Why Transformers Need Adam: A Hessian Perspective

Poisson Variational Autoencoder

Noether’s Razor: Learning Conserved Quantities

Read more paper reviews

Latest News

Latest Events

OxML 2025

G-Research networking drinks at EuroPython 2025

ML in PL Conference 2025

Stay up to date with
G-Research

NeurIPS Paper Reviews 2024 #3

Mark, Senior Quantitative Researcher

Why Transformers Need Adam: A Hessian Perspective

Poisson Variational Autoencoder

Noether’s Razor: Learning Conserved Quantities

Read more paper reviews

Latest News

Latest Events

OxML 2025

G-Research networking drinks at EuroPython 2025

ML in PL Conference 2025

Stay up to date with G-Research

Stay up to date with
G-Research