Convolutional State Space Models for Long-Range Spatiotemporal Modeling
Jimmy T.H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon
Linear State Space Models are a classical tool in Signal Processing and Control Theory. They also became in recent years an object of interest for Machine Learning researchers, when it was discovered that composing properly structured and initialized models of this type with nonlinear activations results in a new type of scalable and efficient sequential architecture. Such architectures, which can be thought of as sophisticated linear RNNs, have shown impressive performance on a number of classification tasks requiring processing long sequences, audio processing, and recently state of the art results in language modelling at a scale of few billion parameters. The model proposed in this paper is a type of State Space Model architecture, that has been designed for processing spatiotemporal data.
The key idea is to use the convolution operation in defining the transition, input and output operations to the state space model. This is analogous to using the convoluiton operation to define the step of ConvRNNs like ConvLSTM and ConvGRU, which are commonly used in spatiotemporal modelling. Since convolution is a linear operation, such definition results in a linear state space model with a distinct structure. The authors opt to restrict the state transition convolution to be pointwise, and integrate this design variant into previously proposed S5 layer. Furthermore, they leverage initialization schemes from previous SSM models and implement an efficient parallel scan for the proposed layer. The resulting architecture, which is named ConvS5, combines the fast stateful inference and training time linear in the sequence length of RNN models with parallelizability of transformers. The proposed model either matches or outperforms state of the art models on a number of spatiotemporal benchmarks, while also comparing favourably to them in terms of required computational resources.
Given that video is the next frontier of generative AI, I am looking forward to seeing how well generative video models with a ConvSSM backbone are going to perform.
Convolutional State Space Models for Long-Range Spatiotemporal Modeling