7

I'm hoping someone can help me think through this. I've come across a lot of different resources on nested-cv, but I think I'm confused as to how to go about model selection and the appropriate construction of confidence intervals for the training process.

I'm trying to train a binary machine learning classifier. I have a small dataset just over 220 samples, 58 of which have my outcome of interest. Ideally, I'd also like to report a confidence interval on my training to help construct a more reliable estimate of my model performance before evaluating on my held-out test set.

I've currently split my data into training and test sets (80/20) and have been implementing nested-cv validation on the training set. The outer and inner loops have 5 and 3-folds, respectively. My rationale for a nested-cv approach is to ensure that I'm not optimistically biasing my training results using the typical standard 5-fold CV approach when tuning/selecting my model, especially considering my small sample size which I suspect is heavily sensitive to the random splits of my data (the 58 cases are heterogenous among themselves - it's a hard prediction task).

I'm wondering if anyone could comment on whether I'm constructing my Nested-CV pipeline correctly and whether my approach for bootstrapping confidence intervals for estimating the performance of my models is sound? Additionally, my pipeline also performs feature selection, which may cause each of the resulting 'best' models to have used different features. Is it still appropriate to average results and bootstrap confidence intervals?

Currently, my code looks something like this:

Setting up some helper functions

cv_tune = StratifiedKFold(n_splits=5, shuffle=True, random_state=1839)

scorer = {'AUC': 'roc_auc', 'Precision': make_scorer(precision_score, zero_division = 0), 'Recall': 'recall', 'Accuracy': 'accuracy', 'log-loss': 'neg_log_loss', 'F1': make_scorer(f1_score, average='binary')}

this allows us to set a seed

def mutual_info_seed(X, y): return mutual_info_classif(X, y, random_state=0) #SelectKBest(score_func=mutual_info_seed)

call smote for oversampling

smt = SMOTE(random_state=42)

Create transformers for each type of feature

numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])

Combine transformers into a ColumnTransformer

step_impute_scale = ('scaler', ColumnTransformer( transformers=[ ('num', numeric_transformer, predictors_to_scale) ], remainder='passthrough' # this leaves the other columns unchanged ))

initiate imputer for missing values

simple_imputer = SimpleImputer(strategy='median')

def bootstrap_ci(scores, n_bootstrap=1000, ci=95): """Compute bootstrap confidence intervals.""" bootstrapped_scores = [] n = len(scores) for _ in range(n_bootstrap): resample = np.random.choice(scores, size=n, replace=True) bootstrapped_scores.append(np.mean(resample)) lower = np.percentile(bootstrapped_scores, (100 - ci) / 2) upper = np.percentile(bootstrapped_scores, 100 - (100 - ci) / 2) return np.mean(scores), lower, upper

For running the model

# instantiate rf 
rf = RandomForestClassifier(random_state=1725)

set up pipeline for data pre-processing, feature selection, and smote

pipeline = Pipeline(steps=[ ('transform_columns', ColumnTransformer([('imputer', simple_imputer, predictors_to_scale)], remainder='passthrough')), ('variance_selection', VarianceThreshold()), ('selectk', SelectKBest(score_func = mutual_info_seed)), ('smote', smt), ('classifier', rf) ])

Define the parameter distributions

param_distributions = { 'classifier__n_estimators': randint(100, 1001), # Randomly choose n_estimators between 100 and 1000 'classifier__max_depth': randint(2, 10), # Randomly choose max_depth between 2 and 9 'classifier__min_samples_split': randint(2, 6), # Randomly choose min_samples_split between 2 and 5 'classifier__min_samples_leaf': randint(2, 6), # Randomly choose min_samples_leaf between 2 and 5 'classifier__criterion': ['gini', 'entropy'], # Randomly choose between gini and entropy 'smote__k_neighbors': randint(1, 10), # Randomly choose k_neighbors for SMOTE between 1 and 5 'selectk__k': randint(5, 15), # Randomly choose k for feature selection between 5 and 15 'variance_selection__threshold': uniform(loc=0, scale=0.3), }

Outer cross-validation

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1839)

To store scores for each metric

outer_scores = {metric: [] for metric in scorer.keys()}

Perform Nested cross-validation

for train_idx, test_idx in outer_cv.split(X_mod, y): X_train, X_test = X_mod.iloc[train_idx], X_mod.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

search = RandomizedSearchCV(pipeline, param_distributions, scoring=scorer, cv=3, n_iter = 250, n_jobs=-1, error_score='raise', 
                       refit = 'F1', random_state=0)
search.fit(X_train, y_train)

# Evaluate on the outer test set for all metrics
y_pred = search.predict(X_test)
y_prob = search.predict_proba(X_test)[:, 1] if hasattr(search, "predict_proba") else None

for metric, scorer_fn in scorer.items():
    if metric == "AUC" and y_prob is not None:
        score = roc_auc_score(y_test, y_prob)
    elif metric == "log-loss" and y_prob is not None:
        score = -log_loss(y_test, y_prob)
    elif callable(scorer_fn):
        score = scorer_fn._score_func(y_test, y_pred)
    else:
        score = accuracy_score(y_test, y_pred)  # Default fallback
    outer_scores[metric].append(score)

Compute mean and bootstrap confidence intervals for each metric

results_summary = {} for metric, scores in outer_scores.items(): mean_score, ci_lower, ci_upper = bootstrap_ci(scores) results_summary[metric] = { 'mean': mean_score, '95% CI': (ci_lower, ci_upper) }

1 Answers1

4

Introduction

Training a binary classifier on a small and imbalanced dataset (220 samples, 58 positives) poses some challenges in ensuring robust model evaluation and generalisation. This response will address:

  1. Correctness of your nested CV pipeline.
  2. Appropriateness of bootstrap confidence intervals for performance metrics.
  3. Implications of feature selection within folds.
  4. Additional recommendations for improvement.

1. Correctness of Nested CV Pipeline

Your nested CV setup involves:

  • Outer CV: 5-fold stratified CV to estimate generalisation.
  • Inner CV: 3-fold stratified CV for hyperparameter tuning and feature selection.

Why Nested CV?

Nested CV mitigates over-optimism in performance estimation by separating:

  1. Model selection (inner loop) from model evaluation (outer loop).
  2. Feature selection and oversampling (eg., SMOTE) from test set predictions, though see the links further below regarding class imbalance.

This ensures:

  • No data leakage.
  • Evaluation reflects the model's performance on truly unseen data.

Mathematical Framework

Nested CV performance can be formalised as: $$ \text{Performance} = \frac{1}{K} \sum_{k=1}^{K} \text{Metric}(y_{\text{test}, k}, f_{\text{train}, k}(x_{\text{test}, k})) $$ where:

  • $K$: Number of outer folds.
  • $y_{\text{test}, k}$: True outcomes for fold $k$.
  • $f_{\text{train}, k}$: Model trained on the $k$th training set.

Assessment

Your nested CV pipeline is well-structured:

  • Stratified Splits: Maintains class balance in folds.
  • Pipeline Integration: Includes preprocessing, feature selection, and SMOTE.
  • Reproducibility: Fixed seeds ensure replicability.

2. Confidence Intervals via Bootstrapping

Bootstrapping is appropriate for estimating variability of metrics like $AUC$, $F1$, etc., by resampling outer CV scores. Your implementation:

  • Resamples with replacement from the outer fold scores.
  • Computes mean and percentile-based confidence intervals (CIs).

Limitations

  1. Small Sample Bias: With only five outer folds, the resampling process might underestimate variability.
  2. Normality Assumptions: The scores may not follow a normal distribution, affecting CI reliability.

Recommendations

  • T-distribution CIs: For small samples, compute: $$ CI = \bar{x} \pm t_{n-1} \cdot \frac{s}{\sqrt{n}} $$ where:

    • $\bar{x}$: Mean score.
    • $t_{n-1}$: Critical value from the t-distribution.
    • $s$: Standard deviation of scores.
    • $n$: Number of folds.
  • Repeat Nested CV: Run the entire pipeline with different random seeds to assess variability.


3. Implications of Feature Selection Across Folds

Feature selection in the inner loop ensures no data leakage but may result in different features across folds. This can:

  • Reduce interpretability if the "best" model changes features.
  • Challenge stability when generalising to new data.

Recommendations

  • Feature Stability Analysis: Report selection frequencies across folds. Features consistently selected (eg., >70% frequency) are more reliable.
  • Alternative Methods: Use permutation importance or SHAP values for feature relevance independent of splits.

Averaging Results:

  • It is appropriate to average performance metrics across folds, as the focus is on predictive performance rather than fixed feature sets. However, provide context on feature variability when interpreting results.

4. Recommendations for Pipeline Improvement

a. Class Imbalance

SMOTE mitigates class imbalance, but alternative methods include:

  • Balanced Random Forest: Adjusts sampling within trees.
  • Cost-sensitive Loss Functions: Penalise misclassification of the minority class. However, I would like to point out that the "class imbalance problem" is not at all the big problem that it is sometimes made out to be. See the following two threads over at Cross Validated:

What is the root cause of the class imbalance problem?
When is unbalanced data really a problem in Machine Learning?

b. Hyperparameter Tuning

Your use of RandomizedSearchCV is efficient. For small datasets, consider:

  • Bayesian Optimisation: Smarter hyperparameter search (eg., Optuna).

c. Robustness Testing

Repeat the nested CV process with different random seeds to assess performance consistency. Compare metrics across iterations to validate robustness.

d. Metric Selection

Focus on metrics suited for imbalanced datasets:

  • Precision-Recall AUC: Provides a more nuanced evaluation than ROC AUC.

5. Results Reporting

Example Table

Metric Mean Score 95% CI (Bootstrap)
AUC 0.85 (0.81, 0.89)
Precision 0.75 (0.71, 0.79)
Recall 0.68 (0.64, 0.72)
F1 Score 0.71 (0.68, 0.74)
Accuracy 0.80 (0.77, 0.83)

Interpretation:

  • Include the number of outer folds ($n=5$) and repetitions for context.
  • Link performance metrics back to business objectives or research questions.

Conclusion

Your implementation demonstrates a good approach to model evaluation. Addressing feature stability, exploring alternative class imbalance methods (though please do read the two links I gave above), and repeating the process for robustness will enhance the reliability of the results. Confidence intervals provide valuable context, but be cautious of their limitations with small datasets.

Robert Long
  • 3,518
  • 12
  • 30