Machine Learning #5: Model Evaluation

4 minute read

Model Diagnostics

Improving a model with high errors on both the training and test sets:

Try a more flexible or complex model
Construct new features

Improving a model with low training error but high test error (overfitting):

Feature selection
Bootstrap aggregation
Try simpler model
Regularization
Gather more data

Learning Curves

Learning curves plot error against training set size. They’re used to diagnose underfitting or overfitting.

An underfit model has a high training error and a high test error

An overfit model has a low training error but a high test error

A perfect model with have low training and testing error.

Evaluating Regression Models

Evaluating regression models usually involves a form of loss function.

Error	Equation	Description
Mean absolute error	$\frac 1N \sum _{i=1}^N \|y^{(i)} - \hat y^{(i)}\|$	Pure magnitude of error
Mean Squared	$\frac 1N \sum _{i=1}^N (y^{(i)} - \hat y^{(i)})^2$	Measure of error magnitude with extra emphasis on large errors
Root mean squared	$\frac 1N \sum _{i=1}^N \sqrt{(y^{(i)} - \hat y^{(i)})^2}$	Measure of euclidian distance of error, extra emphasis on large errors
$R^2$	$1 - \frac{RSS}{TSS}$	How well the model performs compared to a constant model
RSS/TSS	$RSS = \sum _{i=1}^N (y^{(i)} - \hat y^{(i)})^2$, $TSS = \sum _{i=1}^N (y^{(i)} - avg(\hat y^{(i)}))^2$	Risidual sum of squares and Total sum of squares
MAPE	$\frac{100}{N} \sum ^N_{i=1}\|\frac{x_i - \hat x_i}{x_i}\|$	Mean absolute percentage error
SMAPE	$\frac{100}{N} \sum ^N_{i=1}\|\frac{2(x_i - \hat x_i)}{x_i + \hat x_i}\|$	Symmetric mean absolute percentage error

Evaluating Classifiers

Classification accuracy gives a measure of performance but does not reveal the types of errors that are being made.

Confusion Matrix

Shows where misclassifications are false positives or false negatives.

Predicted	Actual 0	Actual 1
0	TN	FN
1	FP	TP

Depending on the application it may be more important to put emphasis on particular types of error. For example false negatives definitely want to be avoided when diagnosing a serious illness.

Given a model predicting the probability of an observation being positive. We can choose a cutoff c such that we predict positive if $\hat f(x) \geq c$ and negative otherwise.

A higher c value will increase the TRUE POSITIVE rate but FALSE POSITIVE will also increase.
A low c value will increase TRUE NEGATIVE but FALSE NEGATIVE will also increase.

True Positive Rate (TPR)

\[TPR = \frac{TP}{P} = \frac{TP}{TP + FN}\]

If c=0, TPR = 1 & TNR = 0

True Negative Rate (TNR)

\[TNR = \frac{TN}{N} = \frac{TN}{TN + FP}\]

If c=1, TNR = 1 & TPR = 0

ROC Curve

ROC (Reciever Operating Characteristic) curves show how well a classifier separates positive and negative classes and the trade-off between the TPR and the TNR as the cut-off c is varied. The x axis is $1-TNR$ and the y-axis is $TPR$, a curve that hugs the upper left corner represents a better classifier.

The area under the curve is called the ROC AUC, it tells us how well ranked our predictions are. An ROC AUC of 1 means we have a perfect classifier meaning some cutoff c exists that will perfectly distinguish positive and negative classes. ROC AUC of 0.5 is equivalent to a random guess.

Code

The ROC curve and ROC AUC can be found using the sklearn ROC curve and ROC AUC score modules respectively.

from sklearn.metrics import roc_curve, roc_auc_score

Assuming a fitted logistic regression model

y_prob = model.predict_proba(test[['inputs']])[:,1]

fpr, tpr, threshold = roc_curve(test['y'], y_prob)
roc_score = roc_auc_score(test['y'], y_prob)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)

print(roc_score)

0.9925

This is a good model!

Brier Score

The Brier score is the same as the mean squared error but is used on models that output probabilities. $\sum _{i=1}^N (y^{(i)} - \hat y^{(i)})^2$

Calibration Curve

The calibration curve is a graph where the x-axis has ranges of predictions and the y-axis has the proportion of positive samples for each range. This shows how well calibrated your model is by plotting predictions against actual positives.
A straight line with gradient of 1 is an ideal ‘well calibrated’ model.

Estimating Expected Loss

Train/Test Split

As test set size $\to \infty$ then the test error will converge to the true error (law of large numbers)

Cross Validation

Cross validation involves splitting a dataset into three subsets;

Training set - Train model
Validation set - Choose hyperparameters
Test set - Estimate risk

K-Fold Cross-Validation

The training set is split into K chunks, iteratively trained on K-1 sets and tested on the Kth set to reduce overfitting. This helps to estimate the generalization error which is the risk of the function. It is also useful for evaluating how many features to include in the training set to minimize both approximation and estimation error. When K=N this is called ‘leave-one-out’ cross validation.

Divide data into K partitions of roughly equal size (typically K is 4-10)
For i in K:
- Train model using the dataset excluding fold K ($D_{-K}$)
- Test model on the kth fold of dataset ($D_K$)
The output is a sequence of error measures. We should take the mean and standard deviation of the errors

Example:
Iteration 1: test = $D_1$, train = $D_2, D_3, D_4, … D_K$
Iteration 2: test = $D_2$, train = $D_1, D_3, D_4, … D_K$
Iteration K: test = $D_K$, train = $D_1, D_2, D_3, … D_{K-1}$

Stratified Sampling

This method is used to ensure representative test and train groups. It involves dividing the dataset into homogenous subgroups and sample the training set from each subgroup. Combining this with K-Fold cross validation is called Stratified K-Fold

Example:
11111100 -> train:[1110], test=[1110]