How to train a xgboost model on data that is too big for the memory?

Question

What are the best practices to train xgboost (eXtreme gradient boosting) models on data that is to big to hold it in memory at once? Splitting the data and train multiple models? Are there more elegant solutions?

aivanov · Answer 1 · 2018-10-26T13:43:37.067

You can train xgboost, calculate the output (margin) and then continue the training, see example in boost from prediction.

I‘ve not tried it myself, but maybe you could train on the first subset of your data (say 10%) and then continue on another subset, etc.

Update

Step by step procedure

Split the data into N manageable subsets, set n=1
Train xgboost on n-th subset
Calculate the prediction (margin) for n+1 subset using the model obtained from previous
Add the margin into the n+1 subset via setinfo
Increment n

Steps 2-5 to be repeated N times.

score 2 · Answer 2 · answered Oct 24 '18 at 03:26

2

I don't think what you are asking for is possible. See this issue.

I understand that you want to train the model on A PART of the data and then continue the training on another PART and so. So @aivanov's answer will not help in this regard.

answered Oct 24 '18 at 03:26

xiaodai

640
1
5
13

score 1 · Answer 3 · answered Feb 20 '18 at 10:02

1

If you are using R, have you considered the bigmemory and ff packages?

I don't have much experience using these myself, but would be interested to see if they help with the issue at hand.

answered Feb 20 '18 at 10:02

bradS

1,695
9
20

How to train a xgboost model on data that is too big for the memory?

3 Answers3

Linked