I'm hoping someone can help me think through this. I've come across a lot of different resources on nested-cv, but I think I'm confused as to how to go about model selection and the appropriate construction of confidence intervals for the training process.
I'm trying to train a binary machine learning classifier. I have a small dataset just over 220 samples, 58 of which have my outcome of interest. Ideally, I'd also like to report a confidence interval on my training to help construct a more reliable estimate of my model performance before evaluating on my held-out test set.
I've currently split my data into training and test sets (80/20) and have been implementing nested-cv validation on the training set. The outer and inner loops have 5 and 3-folds, respectively. My rationale for a nested-cv approach is to ensure that I'm not optimistically biasing my training results using the typical standard 5-fold CV approach when tuning/selecting my model, especially considering my small sample size which I suspect is heavily sensitive to the random splits of my data (the 58 cases are heterogenous among themselves - it's a hard prediction task).
I'm wondering if anyone could comment on whether I'm constructing my Nested-CV pipeline correctly and whether my approach for bootstrapping confidence intervals for estimating the performance of my models is sound? Additionally, my pipeline also performs feature selection, which may cause each of the resulting 'best' models to have used different features. Is it still appropriate to average results and bootstrap confidence intervals?
Currently, my code looks something like this:
Setting up some helper functions
cv_tune = StratifiedKFold(n_splits=5, shuffle=True, random_state=1839)
scorer = {'AUC': 'roc_auc',
'Precision': make_scorer(precision_score, zero_division = 0),
'Recall': 'recall',
'Accuracy': 'accuracy',
'log-loss': 'neg_log_loss',
'F1': make_scorer(f1_score, average='binary')}
this allows us to set a seed
def mutual_info_seed(X, y):
return mutual_info_classif(X, y, random_state=0)
#SelectKBest(score_func=mutual_info_seed)
call smote for oversampling
smt = SMOTE(random_state=42)
Create transformers for each type of feature
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
Combine transformers into a ColumnTransformer
step_impute_scale = ('scaler', ColumnTransformer(
transformers=[
('num', numeric_transformer, predictors_to_scale)
],
remainder='passthrough' # this leaves the other columns unchanged
))
initiate imputer for missing values
simple_imputer = SimpleImputer(strategy='median')
def bootstrap_ci(scores, n_bootstrap=1000, ci=95):
"""Compute bootstrap confidence intervals."""
bootstrapped_scores = []
n = len(scores)
for _ in range(n_bootstrap):
resample = np.random.choice(scores, size=n, replace=True)
bootstrapped_scores.append(np.mean(resample))
lower = np.percentile(bootstrapped_scores, (100 - ci) / 2)
upper = np.percentile(bootstrapped_scores, 100 - (100 - ci) / 2)
return np.mean(scores), lower, upper
For running the model
# instantiate rf
rf = RandomForestClassifier(random_state=1725)
set up pipeline for data pre-processing, feature selection, and smote
pipeline = Pipeline(steps=[
('transform_columns', ColumnTransformer([('imputer', simple_imputer, predictors_to_scale)],
remainder='passthrough')),
('variance_selection', VarianceThreshold()),
('selectk', SelectKBest(score_func = mutual_info_seed)),
('smote', smt),
('classifier', rf)
])
Define the parameter distributions
param_distributions = {
'classifier__n_estimators': randint(100, 1001), # Randomly choose n_estimators between 100 and 1000
'classifier__max_depth': randint(2, 10), # Randomly choose max_depth between 2 and 9
'classifier__min_samples_split': randint(2, 6), # Randomly choose min_samples_split between 2 and 5
'classifier__min_samples_leaf': randint(2, 6), # Randomly choose min_samples_leaf between 2 and 5
'classifier__criterion': ['gini', 'entropy'], # Randomly choose between gini and entropy
'smote__k_neighbors': randint(1, 10), # Randomly choose k_neighbors for SMOTE between 1 and 5
'selectk__k': randint(5, 15), # Randomly choose k for feature selection between 5 and 15
'variance_selection__threshold': uniform(loc=0, scale=0.3),
}
Outer cross-validation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1839)
To store scores for each metric
outer_scores = {metric: [] for metric in scorer.keys()}
Perform Nested cross-validation
for train_idx, test_idx in outer_cv.split(X_mod, y):
X_train, X_test = X_mod.iloc[train_idx], X_mod.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
search = RandomizedSearchCV(pipeline, param_distributions, scoring=scorer, cv=3, n_iter = 250, n_jobs=-1, error_score='raise',
refit = 'F1', random_state=0)
search.fit(X_train, y_train)
# Evaluate on the outer test set for all metrics
y_pred = search.predict(X_test)
y_prob = search.predict_proba(X_test)[:, 1] if hasattr(search, "predict_proba") else None
for metric, scorer_fn in scorer.items():
if metric == "AUC" and y_prob is not None:
score = roc_auc_score(y_test, y_prob)
elif metric == "log-loss" and y_prob is not None:
score = -log_loss(y_test, y_prob)
elif callable(scorer_fn):
score = scorer_fn._score_func(y_test, y_pred)
else:
score = accuracy_score(y_test, y_pred) # Default fallback
outer_scores[metric].append(score)
Compute mean and bootstrap confidence intervals for each metric
results_summary = {}
for metric, scores in outer_scores.items():
mean_score, ci_lower, ci_upper = bootstrap_ci(scores)
results_summary[metric] = {
'mean': mean_score,
'95% CI': (ci_lower, ci_upper)
}