Skip to main content
Back to News
NeurIPs Paper Reviews 2023 #5

NeurIPs Paper Reviews 2023 #5

23 January 2024
  • Quantitative Research

Our team of quantitative researchers have shared the most interesting research presented during workshops and seminars at NeurIPs 2023.

Discover the perspectives of machine learning engineer Laurynas, as he discusses his most compelling findings from the conference.

NeurIPs Booth 2022

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré

A number of great papers came out of Christopher Re’s lab over the past few years, bringing ideas from database design and classical signal processing to neural sequence modelling. In particular, leveraging GPU memory architecture to derive hardware-aware implementation (FlashAttention) of Transformer mechanism, and adopting state-space models on continuous signals for discrete language modelling (S4, H3), respectively. In both cases, these contributions enabled model training with long-range context and reduced hardware resources.

In this paper, authors tackle the quadratic runtime scaling problem ( in the sequence length) of attention architectures by building on the ideas above. As demonstrated in prior works, long-convolution based architectures proved to be powerful and promising replacements for attention modules. They possess much lower  asymptotic runtime (via FFT implementation), however, suffer from poor GPU utilisation (only ). By leveraging previously introduced expressive structured (block-diagonal) Monarch matrices authors propose a Monarch Mixer architecture, which exhibits sub-quadratic runtime and much higher GPU utilization of , thus allowing training on increased sequence lengths.

In the particular case of sequence prediction, enforcing causal relationship between input and output tokens is essential, which is lost in the FFT implementation. Authors derive a novel interpretation of Monarch matrix multiplication as a multivariate polynomial evaluation and interpolation, which I found particularly surprising and interesting.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
NeurIPS 2022 Paper Reviews

Read paper reviews from NeurIPS 2022 from a number of our quantitative researchers and machine learning practitioners.

Read now

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Large language models (LLMs) are notoriously hardware intensive to train and run inference at 16-/32-bit precision.

Quantizing model weights can be an efficient way to run inference, however, that often breaks down at training time. On the other hand, fine-tuning entire model weights for downstream tasks would result in distinct task-specific parameter copies. It was demonstrated in the previous work, that Low-rank Adaptors (LoRA) could be an efficient way to fine-tune LLMs (in-fact, any models), where the original parameter copy  is frozen, and only a low-rank/low-cardinality weight matrices are learned giving the new fine-tuned weights as .

In this paper, authors, building on the LoRA methodology, demonstrates that efficiently storing weights  in a quantized 4-bit representation, and learning 16-bit  parameters does not degrade performance compared to the fine-tuned full-precision counterparts. Moreover, fine-tuning LLaMA 56B parameter model becomes feasible on a single 48GB GPU. This is achieved by three main contributions: (i) using 4-bit NormalFloat storage data-type, (ii) Double Quantization for reducing memory overhead of quantization, and (iii) leveraging Nvidia unified memory paging for managing memory spikes.

In particular, (i) relies on the observation that trained model weights tend to follow normal distribution, thus quantizing with NormalFloats results in more uniform quantization buckets and reduced quantization error. Furthermore, weight quantization error being proportional to quantization block-size requires a sufficiently small block-size for maintaining model’s quality. However, this can cause significant overhead for storing block scaling factors. Authors, (ii) employ a second level of quantization (of the first quantization scaling factors), thus reducing total memory requirement. The surprising thing being that such operation does not degrade overall performance on many empirical language benchmarks.

QLORA: Efficient Finetuning of Quantized LLMs

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more of our quantitative researchers thoughts

NeurIPs Paper Reviews 2023 #1

Discover the perspectives of Danny, one of our machine learning engineers, on the following papers:

  • A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
  • Normalization Layers Are All That Sharpness-Aware Minimization Needs
Paper Review #1
NeurIPs Paper Reviews 2023 #2

Discover the perspectives of Paul, one of our quantitative researchers, on the following papers:

  • Sharpness-Aware Minimization Leads to Low-Rank Features
  • When Do Neural Nets Outperform Boosted Trees on Tabular Data?
Paper Review #2
NeurIPs Paper Reviews 2023 #3

Discover the perspectives of Szymon, one of our quantitative researchers, on the following papers:

  • Convolutional State Space Models for Long-Range Spatiotemporal Modeling
  • How to Scale Your EMA
Paper Review #3
NeurIPS Paper Review 2023 #4

Discover the perspectives of Dustin, our scientific director, on the following papers:

  • Abide by the law and follow the flow: conservation laws for gradient flows
  • The Tunnel Effect: Building Data Representations in Deep Neural Networks
Paper Review #4
NeurIPS Paper Review 2023 #6

Discover the perspectives of Rui, one of our quantitative analysts, on the following papers:

  • Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting
  • Conformal Prediction for Time Series with Modern Hopfield Networks
Paper Review #6

Latest News

James Maynard on Prime Numbers: Cryptography, Twin Primes and Groundbreaking Discoveries
  • 19 Dec 2024

We were thrilled to welcome James Maynard, Fields Medallist 2022 and Professor of Number Theory, at the Mathematical Institute in Oxford, on stage for the latest Distinguished Speaker Symposium last month. James’ talk on Patterns in prime numbers hones in on unanswered questions within mathematics and the recent developments that have brought the solutions to those problems closer to reality. Hear more in his exclusive interview with us.

Read article
Going 15 Percent Faster with Graph-Based Type-checking (part one)
  • 19 Dec 2024

Hear from Florian, Open-Source Software Engineer, on the challenges and breakthroughs behind Project Velocity, an internal initiative aimed at enhancing the .NET developer experience.

Read article
Cliff Cocks on the Origins of Public Key Cryptography
  • 18 Dec 2024

Cliff Cocks – instrumental to the development of public key cryptography during his time at GCHQ – was the first of our speakers at the latest Distinguished Speaker Symposium. Learn more in his exclusive interview with us.

Read article

Latest Events

  • Technology Innovation and Open Source

Open UK: State of Open Con 2025

04 Feb 2025 - 05 Feb 2025 Sancroft, Rose St, Paternoster Sq., St Paul's London EC4M 7DQ
  • Quantitative Research

Italian PhD Prize Award Ceremony 2025

22 Jan 2025 - 24 Jan 2025 Palazzo Madama, 00186 Roma RM, Italy
  • Data Science

Seminar: MPhil in Data Intensive Science – University of Cambridge

13 Feb 2025 The Old Schools, Trinity Lane, Cambridge CB2 1TN

Stay up to date with
G-Research