1

I'm using Kaggle's titanic set. I'm using pieplines and I'm trying to prune my decision tree and for that I want the cost_complexity_pruning_path. The last line of code produces the error: ValueError: could not convert string to float: 'male' Do you know what I'm doing wrong? I have looked at Sklearn: applying cost complexity pruning along with pipeline but that doesn't seem to be helping in my case

cat_vars = ['Sex','Embarked']
num_vars = ['Age']

num_pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),('std_scaler', StandardScaler())]) cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),('ohe', OneHotEncoder())])

col_trans = ColumnTransformer([('numerical', num_pipe, num_vars),('categorical', cat_pipe, cat_vars)] ,remainder='passthrough')

final_pipe = Pipeline([('column_trans', col_trans), ('tree', DecisionTreeClassifier(random_state=42))]) final_pipe.fit(X_train, y_train)

path = final_pipe.steps[1][1].cost_complexity_pruning_path(X_train, y_train)

user5744148
  • 113
  • 2

1 Answers1

1

Because cost_complexity_pruning_path refits the tree model on the data you provide before doing the pruning (source), you need to preprocess the data first. So this should do it:

X_preproc = final_pipe[:-1].transform(X_train)
path = final_pipe.steps[-1][1].cost_complexity_pruning_path(X_preproc, y_train)
Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63