Below I outline an informal definition used for assessing ML models (based on this book).
Compute the error on both the training set and validation set. The train set error can be interpreted as the model's bias. The variance is how much higher the validation error is from the training error.
$bias := train~set~error$
$variance := val~set~error-bias$
We don't expect a model to perform better on the validation set than on the training set, thus $variance\geq 0$ (a validation error curve will not cross or go below the training error curve).
Using these definitions, the various scenarios are:
Overfitting: low bias, high variance
- Often because the model is simply memorising the training set (low bias), and consequently failing to generalise to the validation set (high variance).
Underfitting: high bias
- Often this is because the model is too simple (high bias, low variance). It could also be complex, but the wrong type of model (high bias, high variance).
Adequately fitted (neither underfitting nor overfitting): low bias, low variance
- This is the desired operating point. The model is complex enough to model the data well (low bias), and in a manner that generalises to new data (low variance).
By "low" and "high", I mean relative to your target error rate. Having $variance>>bias$ might seem like overfitting because of the large train-validation gap, but I wouldn't call it overfitting if the validation score is nonetheless good and within spec. In other words, I am judging error rates relative to the desired error rate, rather than on simply the gap between the train and validation scores.
Example 1: using a single trained model (no CV)
Fitting a logistic regression model on a binary classification problem.
Results:
train loss: 0.173 | val loss: 0.176
bias: 0.173 | variance: 0.003 | variance:bias ratio is 0.017
The model is high-bias (train error rate of 17%) and low-variance (0.3%), characteristic of an underfitting model.
import numpy as np
#Data for testing
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, random_state=0)
#Split data. Just a train-validation set for this demo.
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0, test_size=0.25, stratify=y)
#Fit model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=np.random.RandomState(0)).fit(X_train, y_train)
#Compute losses/error rates and report
use_brier_score = False
if use_brier_score:
from sklearn.metrics import brier_score_loss
train_proba = model.predict_proba(X_train)[:, 1]
val_proba = model.predict_proba(X_val)[:, 1]
train_loss = brier_score_loss(y_train, train_proba)
val_loss = brier_score_loss(y_val, val_proba)
else:
#Use accuracy, and calculate the error rate
train_loss = 1 - model.score(X_train, y_train)
val_loss = 1 - model.score(X_val, y_val)
#To bias and variance
bias = train_loss
variance = val_loss - bias
print('train loss: %.3f' % train_loss, '| val loss: %.3f' % val_loss)
print('bias: %.3f' % bias, '| variance: %.3f' % variance, end=' ')
print('| variance:bias ratio is %.3f' % (variance / bias))
Example 2: CV
Running 5-fold stratified CV. So rather than evaluating a single pre-fitted model, we fit models on 5 different splits and average the results.
Accuracy (%)
trn: 82.7
val: 82.6
Error rate (%)
trn: 17.28
val: 17.40
bias: 17.28 | variance: 0.12 | variance:bias ratio=0.007
import numpy as np
#Data for testing
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, random_state=0)
#Split off a test set (not used)
from sklearn.model_selection import train_test_split
X_cv, X_test, y_cv, y_test = train_test_split(X, y, random_state=0, test_size=0.25, stratify=y)
#Split data. Just a train-validation set for this demo.
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
#Uses 5-fold stratified CV and "accuracy" (default for binary/multiclass y)
np.random.seed(0)
cv_results = cross_validate(
LogisticRegression(), X_cv, y_cv,
return_train_score=True,
# scoring='accuracy',
# cv=5,
)
train_acc = cv_results['train_score'].mean() * 100
val_acc = cv_results['test_score'].mean() * 100
#Could also extract std, confidence intervals, median/IQR, etc
#
#Compute error rates and report
#
train_error = 100 - train_acc
val_error = 100 - val_acc
#To bias and variance
bias = train_error
variance = val_error - bias
print('Accuracy (%)')
print(' trn: %.1f' % train_acc, '\n val: %.1f' % val_acc)
print('\nError rate (%)')
print(' trn: %.2f' % train_error, '\n val: %.2f' % val_error)
print(
'\nbias: %.2f' % bias, '| variance: %.2f' % variance,
'| variance:bias ratio=%.3f' % (variance / bias)
)
```