Skip to main content
Back to News
The best of ICML 2022 – Paper reviews (part 6)

The best of ICML 2022 – Paper reviews (part 6)

8 October 2022
  • Quantitative Research

This article is one of a series of paper reviews from our researchers and machine learning engineers – view more

G-Research booth at ICML 2022

Last month, G-Research were Diamond sponsors at ICML, hosted this year in Baltimore, US.

As well as having a stand and team in attendance, a number of our quantitative researchers and machine learning engineers attended as part of their ongoing learning and development.

We asked our quants and machine learning practitioners to write about some of the papers and research that they found most interesting.

Here, Jonathan L, Quantitative Researcher at G-Research, discusses three papers.

Domain Adaptation for Time Series Forecasting via Attention Sharing

Xiaoyong Jin, Youngsuk Park, Danielle C. Maddix, Hao Wang, Yuyang Wang

The authors address time series forecasting with deep neural networks given a source dataset with abundant data samples, and a target dataset with a limited number of samples whose respective targets may have different representations.

As argued by the authors, this setting is challenging as domain-specific predicted values are not subject to a fixed vocabulary, and many domain-specific confounding factors cannot be encoded in a master pre-trained model. They propose a novel domain adaptation framework (DAF) whose main innovation is to train two distinct models jointly with a shared attention module. More precisely, their solution employs a sequence generator to process time series from each domain.

Each sequence generator consists of an encoder, an attention module and a decoder. As each domain provides data with distinct patterns from different spaces, they keep the encoders and decoders privately owned by the respective domain. The core attention module is shared by both domains for adaptation.

In addition to the shared attention module, their key innovation is to introduce a shared discriminator to induce the keys and queries of the shared attention module to be domain-invariant. The discriminator aims to classify the domain of the key-query pair, and it is trained in an adversarial manner, along with a generator that aims to confuse the discriminator. The authors show through extensive numerical results the benefits of this shared attention module.

This architecture is conceptually simple and versatile enough to be applied to many real-world problems.

Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions

Aaron Mishkin, Arda Sahiner, Mert Pilanci

In this paper, the authors propose a new fast algorithm for optimising two-layer neural networks with ReLU activation functions and weight decay.

The key starting point is to observe that the ReLU activation function partitions the variable space into a number P of linear cones (patterns): for an n x d data matrix X, these linear cones are described by the set {sign(X u >= 0)} as we let u vary in the space of d-dimensional real vectors.

By considering all such possible patterns, they re-formulate the non-convex optimisation problem as a constrained convex optimisation problem that involves as many optimisation variables (equivalent to the neurons) and linear constraints as patterns. The computational challenge lies in the prohibitively large number P of such patterns.

However, the convex problem involves a group-l2 regularisation that enforces sparsity over the neurons, where it is expected that a small number of patterns are ‘active’ at an optimal solution. In fact, a heuristic approach in the literature so far has been to subsample such patterns randomly. The authors develop a more principled approach to safely ignore irrelevant patterns. By reducing the dimensionality of the problem, they propose an accelerated proximal gradient method and an augmented Lagrangian solver.

Their algorithms are shown to be faster than standard training heuristics for the non-convex problem, such as SGD, and outperform interior-point solvers. Remarkably, the models generalise as well as Adam/SGD on standard machine learning benchmarks. Although this work only addresses two-layer ReLU networks, it sets a solid and sound ground to address deeper networks with more intricate architectures and for GPU acceleration. I am looking forward to their follow-up works.

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci

In a vein similar to the aforementioned work that addresses two-layer ReLU networks, this paper aims to understand attention modules through the lens of convexity.

The authors formulate an equivalent convex program to solve global optimality, an architecture with m heads of attention followed by a channel mixing layer (a.k.a. classification head).

An appealing result is the clear interpretability of the attention weights that the convex formulation provides. More precisely, they demonstrate first that a linear activation self-attention model (as opposed to the standard softmax activation) can be reformulated as a convex objective that involves a feature correlation matrix weighted linear model, with a nuclear norm penalty which groups the individual models to each other.

In particular, this convex formulation implicitly clusters correlated features and provides an importance weight to each cluster. Conversely, they show that an attention module in the (standard) non-convex space can be mapped to the convex space.  This is practical even for real-world models and appealing, as one can map attention modules to the convex space and investigate these clusters of features.

Finally, these results extend to non-linear activations such as the ReLU and several variants of channel mixing layers with similar interpretability power.

View more ICML 2022 paper reviews

The best of ICML 2022 – Paper reviews (part 7)
  • 08 Oct 2022
Read more

Stay up to date with
G-Research