Arthur da Cunha, Mikael Møller Høgsgaard, Kasper Green Larsen
The main contribution of this paper lies in the presentation of an algorithm that parallelizes boosting in a somewhat efficient way.
The scenario used by the authors is quite restrictive but the general idea should be easily extendable:
Consider a binary classification problem (i.e. all responses are either +1 or -1) on a sample set X of size m. A γ-weak-learner W is a learning algorithm (e.g. fitting a tree) that for any distribution D over X, when given at least some constant number of samples from D, produces with constant probability a prediction h that is wrong with probability less than 1/2 – γ under D.
In many cases, computing W is not parallelizable and classical boosting frameworks (e.g. AdaBoost) need to call into W sequentially, thus no parallel computation is possible.
The main point of the paper is to create an algorithm that leverages parallel calls into W in some optimal way. The total amount of work done will increase exponentially in this case, but wall clock time might still reduce if enough computational capacity is available.
The main idea of the algorithm is as follows:
For p rounds, subsample t sample-sets from your current distribution in parallel. Group those t weak learners into a certain number of groups and (sequentially) apply a normal update step for the best (or -worst) weak learner in each group if it improves overall loss.
The paper then suggests how to choose p and t optimally given a desired degree of parallelism t. Here t = 1 is roughly equivalent to AdaBoost and p log t should remain approximately constant for optimal performance.
The paper also proves that this bound is optimal up to some log-factors, by constructing a slightly pathological weak-learner for the biased coin problem. I am not completely convinced that we would encounter such weak learners a lot in practice and potentially further improvements are possible in many practical instances.
Optimal Parallelization of Boosting