Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Se-Young Yun
A big trend at NeurIPS this year was generative modelling, in particular diffusion and flow matching methods.
This paper applies some of the advances in flow matching to reinforcement learning with human feedback, a form of preference alignment where the aim is to align the behaviour of a given model with human (or an AI proxy) preference.
Whereas some of the previous techniques in that field require access to the model weights (and to possibly significant computing power) for fine-tuning, or learning a reward model that can be prone to overfitting, Preference Flow Matching (PFM) only requires access to the inference model as a black-box, and to a way of determining which of two model samples is preferred for a given conditioning input, without learning any reward model.
Given these and a distribution of inputs, one can define the distributions of less preferred data and of more preferred data in the sample space by comparing outputs two by two. PFM then learns a time-dependent flow from the former to the latter. At inference time, given a sample from the base model, one can simply flow it towards the more preferred distribution to obtain a better sample.
The authors apply their techniques to several datasets, notably MNIST and IMBD where the preference is given by the logits from a CNN or a sentiment classifier, and various offline reinforcement learning tasks from D4RL, demonstrating that the preference objective is attained.
They also include several theoretical results, showing that PFM indeed “narrows” the base model distribution towards the points where the preference is increasing. Finally, they note that an iterative application of PFM is possible and can be beneficial.
Preference Alignment with Flow Matching