NeurIPs Paper Reviews 2023 #5

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré

A number of great papers came out of Christopher Re’s lab over the past few years, bringing ideas from database design and classical signal processing to neural sequence modelling. In particular, leveraging GPU memory architecture to derive hardware-aware implementation (FlashAttention) of Transformer mechanism, and adopting state-space models on continuous signals for discrete language modelling (S4, H3), respectively. In both cases, these contributions enabled model training with long-range context and reduced hardware resources.

In this paper, authors tackle the quadratic runtime scaling problem ( in the sequence length) of attention architectures by building on the ideas above. As demonstrated in prior works, long-convolution based architectures proved to be powerful and promising replacements for attention modules. They possess much lower asymptotic runtime (via FFT implementation), however, suffer from poor GPU utilisation (only ). By leveraging previously introduced expressive structured (block-diagonal) Monarch matrices authors propose a Monarch Mixer architecture, which exhibits sub-quadratic runtime and much higher GPU utilization of , thus allowing training on increased sequence lengths.

In the particular case of sequence prediction, enforcing causal relationship between input and output tokens is essential, which is lost in the FFT implementation. Authors derive a novel interpretation of Monarch matrix multiplication as a multivariate polynomial evaluation and interpolation, which I found particularly surprising and interesting.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

NeurIPS 2022 Paper Reviews

Read paper reviews from NeurIPS 2022 from a number of our quantitative researchers and machine learning practitioners.

Read now

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Large language models (LLMs) are notoriously hardware intensive to train and run inference at 16-/32-bit precision.

Quantizing model weights can be an efficient way to run inference, however, that often breaks down at training time. On the other hand, fine-tuning entire model weights for downstream tasks would result in distinct task-specific parameter copies. It was demonstrated in the previous work, that Low-rank Adaptors (LoRA) could be an efficient way to fine-tune LLMs (in-fact, any models), where the original parameter copy is frozen, and only a low-rank/low-cardinality weight matrices are learned giving the new fine-tuned weights as .

In this paper, authors, building on the LoRA methodology, demonstrates that efficiently storing weights in a quantized 4-bit representation, and learning 16-bit parameters does not degrade performance compared to the fine-tuned full-precision counterparts. Moreover, fine-tuning LLaMA 56B parameter model becomes feasible on a single 48GB GPU. This is achieved by three main contributions: (i) using 4-bit NormalFloat storage data-type, (ii) Double Quantization for reducing memory overhead of quantization, and (iii) leveraging Nvidia unified memory paging for managing memory spikes.

In particular, (i) relies on the observation that trained model weights tend to follow normal distribution, thus quantizing with NormalFloats results in more uniform quantization buckets and reduced quantization error. Furthermore, weight quantization error being proportional to quantization block-size requires a sufficiently small block-size for maintaining model’s quality. However, this can cause significant overhead for storing block scaling factors. Authors, (ii) employ a second level of quantization (of the first quantization scaling factors), thus reducing total memory requirement. The surprising thing being that such operation does not degrade overall performance on many empirical language benchmarks.

QLORA: Efficient Finetuning of Quantized LLMs

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Latest News

G-Research March 2025 Grant Winners

22 Apr 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our March grant winners.

Read article

Invisible Work of OpenStack: Eventlet Migration

25 Mar 2025

Hear from Jay, an Open Source Software Engineer, on tackling technical debt in OpenStack. As technology evolves, outdated code becomes inefficient and harder to maintain. Jay highlights the importance of refactoring legacy systems to keep open-source projects sustainable and future-proof.

Read article

SXSW 2025: Key takeaways from our Engineers

24 Mar 2025

At G-Research we stay at the cutting edge by prioritising learning and development. That’s why we encourage our people to attend events like SXSW, where they can engage with industry experts and explore new ideas. Hear from two Dallas-based Engineers, as they share their key takeaways from SXSW 2025.

Read article

Latest Events

Quantitative Engineering
Quantitative Research

SIAM Conference on Financial Mathematics and Engineering

15 Jul 2025 - 18 Jul 2025 Hyatt Regency Miami, 400 SE 2nd St, Miami, FL 33131, United States

More info

Quantitative Engineering
Quantitative Research

MPP/MPQ Career Day

30 Apr 2025 Max Planck Institute for Physics, Boltzmannstraße 8, 85748 Garching bei München, Germany

More info

Quantitative Engineering
Quantitative Research

Imperial PhD Careers Fair

10 Jun 2025 Queen's Tower Rooms, Sherfield Building, South Kensington Campus, Imperial College London, London, SW7 2AZ

More info

NeurIPs Paper Reviews 2023 #5

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

QLoRA: Efficient Finetuning of Quantized LLMs

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

SIAM Conference on Financial Mathematics and Engineering

MPP/MPQ Career Day

Imperial PhD Careers Fair

Stay up to date with
G-Research

NeurIPs Paper Reviews 2023 #5

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

QLoRA: Efficient Finetuning of Quantized LLMs

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

SIAM Conference on Financial Mathematics and Engineering

MPP/MPQ Career Day

Imperial PhD Careers Fair

Stay up to date with G-Research

Stay up to date with
G-Research