Why does the regression model produced by XGBoost depend on the order of the training data when more than 8194 data points are used?

Question

When I use XGBRegressor to construct a boosted tree model from 8194 or fewer data points (i.e., n_train $\leq$ 8194, where n_train is defined in the code below) and randomly shuffle the data points before training, the fit method is order independent, meaning that it generates the same predictive model each time that it is called. However, when I do the same for 8195 data points, fit is order dependent -- it generates a different predictive model for each call. Why is this?

I have read this paper on XGBoost and nearly all of the XGBoost documentation, and the non-subsampling algorithms described in both appear to be order independent for all n_train. So the source of the order dependence for large-n_train datasets is the mysterious part.

Below is a minimal Python script that illustrates the issue.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
M = 2  # number of models to compare
tree_method = 'approx'  # tree_method of XGBRegressor.  Also try 'hist' and 'exact'.
n_disp = 5  # number of elements of y_test_pred[m] to display
np.set_printoptions(precision=5, linewidth=1000, suppress=True)
------------------------------------------------------------------------------------------
def main_func():
for n_samples in [10243, 10244]:

    # Construct X and y
    X, y = make_regression(n_samples=n_samples)

    # Split X and y for training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    n_train = y_train.shape[0]

    # Train the models and use them to predict y_test
    model = M * [None]
    y_test_pred = M * [None]
    for m in range(M):
        model[m] = train_model(n_train, X_train, y_train, X_test, y_test, m)
        y_test_pred[m] = model[m].predict(X_test)
        print('---')
        print(f'n_train = {n_train}')
        print(f'y_test_pred[m][:{n_disp}] for m = {m}:')
        print(y_test_pred[m][:n_disp])


------------------------------------------------------------------------------------------
def train_model(n_train, X_train, y_train, X_test, y_test, m):
# Permute X_train and y_train
p = np.random.permutation(n_train)
X_train = X_train[p]
y_train = y_train[p]

# Construct and train the model
model = XGBRegressor(tree_method=tree_method, random_state=42)
model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)

return model


------------------------------------------------------------------------------------------
main_func()

One run of this code yields:

---
n_train = 8194
y_test_pred[m][:5] for m = 0:
[ 138.66483  -20.09365   62.82829 -136.29303 -120.78113]
---
n_train = 8194
y_test_pred[m][:5] for m = 1:
[ 138.66483  -20.09365   62.82829 -136.29303 -120.78113]
---
n_train = 8195
y_test_pred[m][:5] for m = 0:
[  20.70109 -125.59986 -140.2009    84.15887  -39.79109]
---
n_train = 8195
y_test_pred[m][:5] for m = 1:
[ -26.50723 -159.95743  -79.36356  108.11007  -38.723  ]

Note that for n_train = 8194, y_test_pred[m][:n_disp] is the same for all m, but for n_train = 8195 it is not.

Within the script, observe that I permute the elements of X_train and y_train before each run. I would expect this to have no effect on the model produced by the fitting algorithm given that, to my understanding, the feature values are sorted and binned near the start of the algorithm. However, if I comment out this permutation, the high-n_train order dependence of the algorithm disappears. Also note that within the XGBRegressor call, tree_method can be set to 'approx', 'hist', or 'auto' and random_state can be set to a fixed value without eliminating the order dependence at large n_train.

Finally, there are several comments in the XGBoost documentation that might initially seem relevant:

The online FAQ for XGBoost states that the issue of "Slightly different result between runs ... could happen, due to non-determinism in floating point summation order and multi-threading. Also, data partitioning changes by distributed framework can be an issue as well. Though the general accuracy will usually remain the same."
And the Python API Reference states that "Using gblinear boost with shotgun updater is nondeterministic as it uses Hogwild algorithm."

For various reasons, however, I suspect that these notes are either unrelated to or inadequate to explain the abrupt transition to order dependence that I have just described.

score 3 · Answer 1 · answered Sep 20 '24 at 20:27

Partial answer.

Setting tree_method="exact" I get the same results across iterations, so there must be a source of randomness inside the weighted quantile sketches ("approx" and "hist"). Without specifying tree_method, the default behavior "auto" selects "exact" for small datasets, and I suppose you've found the threshold (although that may depend on other things, like the number of features).

But I'm not sure why the binning would have a random component?? And setting random_state doesn't even fix that.

score 2 · Answer 2 · answered Sep 26 '24 at 03:46

An answer to this question was offered on the GitHub page for XGBoost:

"The quantile sketching works on stream of data and prunes the summary as more input comes in. In such case, prune results can be dependent on the arrival order of the data."

The specifics of the pruning algorithm are covered in the Supplementary Material of the XGBoost paper presented at KDD '16.

Additionally, a critical observation was made by maxaehle on the above GitHub page: "When only the first 8194 rows are permuted and the last row stays last, both outputs for 8195 seem to be the same."

Moreover, the threshold value $8194$ is quite close to $2^{13} = 8192$.

Thus, tentatively, it appears that there is a hard-coded threshold of n_train $= 8195$ at which the pruning algorithm or quantile sketching either becomes active or highly order dependent. The precise details of this remain unclear.

Why does the regression model produced by XGBoost depend on the order of the training data when more than 8194 data points are used?

------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------

2 Answers2