Machine Learning #1: Basics

2 minute read

Machine Learning Overview

Machine learning is the way computers can use statistical theories and models on real data to gain insight, make predictions, categorize and more.

Workflow

Generally the objective of machine learning is to find a model with coefficients that matches the input data as closely as possible. This is done through recursive training with the aim of minimizing some kind of loss or risk.

Study the problem
Train the algorithm on input data
Evaluate the solution
- Bad -> Analyse -> #1
- Good -> #4
Deploy

Challenges

Insufficient quantity of data
Non-representative training data
Low quality data
Bad features/predictors
Non-stationarity (correct answer for same input changes over time)
Covariant drift (testing data drifts since time of training)
Overfitting (fits too well to training data)
Underfitting (doesn’t fit well enough to training data)

Scikit-Learn

Scikit-Learn is a Python module for machine learning that I’ll be using to implement some of the models and techniques discussed. In scikit-learn there are three object types.

Estimators: Objects that estimate parameters [.fit()]
Transformers: Estimators that can transform datasets [.transform() & .fit-transform()]
Predictors: Estimators that make predictions [.predict()]

Gradient Descent

Gradient descent is a method of finding the local minima of a function by varying some coefficient. It can converge to the global minima given a convex (bowl shaped) function and small step size.
Since the aim of machine learning is to minimize some loss, we can use gradient descent on the loss function to optimize our algorithm.

To minimise $f(x)$ with respect to $x$:

Initialise $x$ (to any number)
Repeat until convergence:
- $x \Leftarrow x - \alpha \nabla _x f(x)$

Where $\alpha$ is the step size and $\nabla _x f(x)$ is the gradient / the differential of $f(x)$.
When $x$ reaches a minima the gradient will equal 0 and $x$ will no longer change, this is convergence.

Vector Gradients

For a p-dimensional vector x the gradient is defined as a vector of $\frac {\delta}{\delta x_i}$ for every $x_i$ in the vector. The gradient will always be the same dimensions as the vector.

Step Size

A large step size can ‘overshoot’ the minima whereas a small step size can take too long to converge. Therefore there is a trade-off between stability and speed.

Stopping Criteria

We define some difference value $\epsilon$ then if $|f(x_t) - f(x_{t-1})| \leq \epsilon$ then STOP. Alternatively, a maximum number of iterations can be defined.

Minibatch Gradient Descent

When the number of points (N) is large, computing the gradient can be expensive. In this case we can choose n random points at each step as a sub-set, using this sub-set is minibatch gradient descent.

Stochastic Gradient Descent

Stochastic (random) gradient descent is the same as minibatch where n=1. This means at each step taking 1 point randomly, finding the gradient and moving from there.