有关于LightGBM增量学习相关的资料可以找到不少,但大多数都只是在代码层面介绍了接口的使用方法(甚至相当一部分时互相搬运粗制滥造的),而具体到原理与实现就很难找到相关资料了。

在StackExchange上,我找到了这么一个回答:

  1. LightGBM will add more trees if we update it through continued training (e.g. through BoosterUpdateOneIter). Assuming we use refit we will be using existing tree structures to update the output of the leaves based on the new data. It is faster than re-training from scratch, since we do not have to re-discover the optimal tree structures. Nevertheless, please note that almost certainly it will have worse performance (on the combined old and new data) than doing a full retrain from scratch on them.
  2. Any online learning algorithm will be designed to adapt to changes. That said, LighyGBM's performance will depend on the training parameters we will use and how we will validate our predictions (e.g. how much we care to disregard previous data points). Assuming we properly train our booster, without having a relevant baseline (e.g. a ridge regression trained on an incremental manner) it does not make sense to say "LightGBM is good (or bad)" for dealing with concept drift.

实际上,就模型更新而言,LightGBM提供了两种方式:

  1. 对增量数据使用新树拟合残差。
  2. 重新拟合现存的树(refit)。

下面就这两种方式分开叙述。

方法1

参考源码的这个位置,可以看到,在创建predictor时意在检查init_model的状态,如果存在初始模型,则从这个模型上读取Booster,否则使用新建的。

后续的训练过程与正常训练保持一致。

方法2

参考源码的这个位置,refit更新模型的方式是使用新数据更新已存在的Booster。源码中refit的过程调用的是C API,具体refit的过程可以参阅这里

值得一提的是,在refit的过程中是保留现存的树结构的,重新拟合的只是树的参数。这种方式可以节省训练时间,但并不保证训练效果更好。


Ref

Last modification:April 21, 2022
博客维护不易,如果你觉得我的文章有用,请随意赞赏