cost-complexity-pruning-path with pipeline

Question

I'm using Kaggle's titanic set. I'm using pieplines and I'm trying to prune my decision tree and for that I want the cost_complexity_pruning_path. The last line of code produces the error: ValueError: could not convert string to float: 'male' Do you know what I'm doing wrong? I have looked at Sklearn: applying cost complexity pruning along with pipeline but that doesn't seem to be helping in my case

cat_vars = ['Sex','Embarked']
num_vars = ['Age']
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),('std_scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),('ohe', OneHotEncoder())])
col_trans = ColumnTransformer([('numerical', num_pipe, num_vars),('categorical', cat_pipe, cat_vars)] ,remainder='passthrough')
final_pipe = Pipeline([('column_trans', col_trans), ('tree', DecisionTreeClassifier(random_state=42))])
final_pipe.fit(X_train, y_train)
path = final_pipe.steps[1][1].cost_complexity_pruning_path(X_train, y_train)

score 1 · Accepted Answer · answered Feb 23 '21 at 14:32

Because cost_complexity_pruning_path refits the tree model on the data you provide before doing the pruning (source), you need to preprocess the data first. So this should do it:

X_preproc = final_pipe[:-1].transform(X_train)
path = final_pipe.steps[-1][1].cost_complexity_pruning_path(X_preproc, y_train)

cost-complexity-pruning-path with pipeline

1 Answers1