Machine Learning #1: Basics
Machine Learning Overview
Machine learning is the way computers can use statistical theories and models on real data to gain insight, make predictions, categorize and more.
Workflow
Generally the objective of machine learning is to find a model with coefficients that matches the input data as closely as possible. This is done through recursive training with the aim of minimizing some kind of loss or risk.
- Study the problem
- Train the algorithm on input data
- Evaluate the solution
- Bad -> Analyse -> #1
- Good -> #4
- Deploy
Challenges
- Insufficient quantity of data
- Non-representative training data
- Low quality data
- Bad features/predictors
- Non-stationarity (correct answer for same input changes over time)
- Covariant drift (testing data drifts since time of training)
- Overfitting (fits too well to training data)
- Underfitting (doesn’t fit well enough to training data)
Scikit-Learn
Scikit-Learn is a Python module for machine learning that I’ll be using to implement some of the models and techniques discussed. In scikit-learn there are three object types.
- Estimators: Objects that estimate parameters [.fit()]
- Transformers: Estimators that can transform datasets [.transform() & .fit-transform()]
- Predictors: Estimators that make predictions [.predict()]
Categories
Supervised
Algorithms learn a function mapping inputs to outputs from labelled data
Example: Regression, Classification
Unsupervised
Algorithms learn patterns from unlabelled data
Example: Clustering, Dimensionality Reduction, Association Rules
Reinforcement
Algorithms learn actions to take to maximise some reward
Example: Video Games (AlphaGo)
Gradient Descent
Gradient descent is a method of finding the local minima of a function by varying some coefficient. It can converge to the global minima given a convex (bowl shaped) function and small step size.
Since the aim of machine learning is to minimize some loss, we can use gradient descent on the loss function to optimize our algorithm.
To minimise $f(x)$ with respect to $x$:
- Initialise $x$ (to any number)
- Repeat until convergence:
- $x \Leftarrow x - \alpha \nabla _x f(x)$
Where $\alpha$ is the step size and $\nabla _x f(x)$ is the gradient / the differential of $f(x)$.
When $x$ reaches a minima the gradient will equal 0 and $x$ will no longer change, this is convergence.
Vector Gradients
For a p-dimensional vector x the gradient is defined as a vector of $\frac {\delta}{\delta x_i}$ for every $x_i$ in the vector. The gradient will always be the same dimensions as the vector.
Step Size
A large step size can ‘overshoot’ the minima whereas a small step size can take too long to converge. Therefore there is a trade-off between stability and speed.
Stopping Criteria
We define some difference value $\epsilon$ then if $|f(x_t) - f(x_{t-1})| \leq \epsilon$ then STOP. Alternatively, a maximum number of iterations can be defined.
Minibatch Gradient Descent
When the number of points (N) is large, computing the gradient can be expensive. In this case we can choose n random points at each step as a sub-set, using this sub-set is minibatch gradient descent.
Stochastic Gradient Descent
Stochastic (random) gradient descent is the same as minibatch where n=1. This means at each step taking 1 point randomly, finding the gradient and moving from there.