25

I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Ghostintheshell
  • 451
  • 1
  • 5
  • 7

3 Answers3

28

You can use the dataframe's .values method to access raw data once you have manipulated the columns as you need them.

E.g.

train = pd.read_csv("train.csv")
target = train['target']
train = train.drop(['ID','target'],axis=1)
test = pd.read_csv("test.csv")
test = test.drop(['ID'],axis=1)

xgtrain = xgb.DMatrix(train.values, target.values)
xgtest = xgb.DMatrix(test.values)

Obviously you may need to change which columns you drop or use as the training target. The above was for a Kaggle competition, so there was no target data for xgtest (it is held back by the organisers).

Neil Slater
  • 29,388
  • 5
  • 82
  • 101
12

You can now use Pandas DataFrames directly with XGBoost. Definitely works with xgboost 0.81.

For example where X_train, X_val, y_train, and y_val are DataFrames:

import xgboost as xgb

mod = xgb.XGBRegressor(
    gamma=1,                 
    learning_rate=0.01,
    max_depth=3,
    n_estimators=10000,                                                                    
    subsample=0.8,
    random_state=34
) 

mod.fit(X_train, y_train)
predictions = mod.predict(X_val)
rmse = sqrt(mean_squared_error(y_val, predictions))
print("score: {0:,.0f}".format(rmse))

jeffhale
  • 410
  • 1
  • 5
  • 9
8

There is some good news there is a library pandas_ml which supports XGBoost. This will probably this streamline the workflow simply.

Ethan
  • 1,657
  • 9
  • 25
  • 39
user4959
  • 191
  • 1
  • 1