9

What are the best practices to train xgboost (eXtreme gradient boosting) models on data that is to big to hold it in memory at once? Splitting the data and train multiple models? Are there more elegant solutions?

Soerendip
  • 744
  • 1
  • 9
  • 16

3 Answers3

6

You can train xgboost, calculate the output (margin) and then continue the training, see example in boost from prediction.

I‘ve not tried it myself, but maybe you could train on the first subset of your data (say 10%) and then continue on another subset, etc.

Update

Step by step procedure

  1. Split the data into N manageable subsets, set n=1
  2. Train xgboost on n-th subset
  3. Calculate the prediction (margin) for n+1 subset using the model obtained from previous
  4. Add the margin into the n+1 subset via setinfo
  5. Increment n

Steps 2-5 to be repeated N times.

aivanov
  • 1,520
  • 10
  • 14
2

I don't think what you are asking for is possible. See this issue.

I understand that you want to train the model on A PART of the data and then continue the training on another PART and so. So @aivanov's answer will not help in this regard.

xiaodai
  • 640
  • 1
  • 5
  • 13
1

If you are using R, have you considered the bigmemory and ff packages?

I don't have much experience using these myself, but would be interested to see if they help with the issue at hand.

bradS
  • 1,695
  • 9
  • 20