CS342 Machine Learning

Lecture 1 – January 8th, 2018

Ask questions on the Machine Learning forum, if you can, instead of emailing Theo directly.

Assessment

Final Exam (60%), based on lecture material & PPs & worksheets
Coursework (40%):
- First assignment (15%).
- Second assignment (25%), on a real-world ML competition platform.

Syllabus

A First Course in Machine Learning, 2nd Edition, S. Rogers & M. Girolami
Machine Learning, T. Mitchell

How does one learn?

Supervised learning
- Direct feedback: car is blue, car is red, car is red, etc…
- Indirect reward feedback: better, worse, hot cold, etc… (e.g. dopamine)
Unsupervised learning
- Creating clusters or groups, finding patterns in data with no direct goal for what constitutes the right pattern to spot.

The common goal of ML systems

They learn from data (“experience” ).
A computer program is said to learn from experience E (data) with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Main subfields of ML

Supervised
- Regression
- Classification
- Ranking
- Structured Prediction
Unsupervised
- Clustering
- Dimensionality Reduction (100x100 image is a vector in a 10000 dimensional space, which needs to be reduced to work with)
- Manifold Learning
Reinforcement
- Markov Decision Process (e.g. Robotics)
- Multi-Agent Systems
- TD/Q-Learning (State-Action-Reward)

Regression

Lecture 2 – January 11th, 2018

Conventionally, in Computer Science, you have your program and input as input to the computer, and you have your output as output. With supervised learning, you have your input and output as input, and you have your program as output.

Hypotheses: the relationship between X and t is linear.

$\hat{t} = f(x;w_0,w_1) = w_0 + w_1 x$

What are the best parameters? Somewhere between the black and the red line $\Rightarrow$ we must have a measure of goodness of fit that we try and maximize.

To generalize:

Data: input $\textbf{X}$ / outputs $\textbf{t}$ .
Parameters: $\textbf{w}$ .
Hypothesis space: $f(x; \textbf{w}) : x \rightarrow t$ .
“Goodness” metric: e.g. the distance between $f(x,\textbf{w})$ and $t$ .

Note: you don’t have to run your learning algorithms on the raw input. By intelligently preprocessing raw data so that you end up with a smaller feature set, you can end up with a more effective result. The concept of creating features from inputs is called feature engineering.

Goodness metrics

Squared loss: $\mathcal{L}_n(t_n,\hat{t}_n) = (t_n - \hat{t}_n)^2$
Absolute difference: $\mathcal{L}_n(t_n,\hat{t}_n) = |t_n - \hat{t}_n|$

$\mathcal{L} = \frac{1}{N}\sum^{N}_{n=1}\mathcal{L}_n = \frac{1}{N}\sum^{N}_{n=1}(t_n - f(x_n;\textbf{w}))^2$

So, what we’re looking for:

$\hat{\textbf{w}} \leftarrow \textrm{argmin}_\textbf{w}\frac{1}{N}\sum^{N}_{n=1}(t_n - f(x_n;\textbf{w}))^2$

Approach:

Replace the expression $f$ with a specific loss function.
Equate first derivative of loss to zero to get $\textbf{w}$ (find potential minima).
Examine second derivative of loss at $\textbf{w}$ to determine if unique minima.

Lecture 3 – January 12th, 2018

Missed.

Lecture 4 – January 15th, 2018

Missed.

Lecture 5 – January 18th, 2018

Euclidean vs. Manhattan distance.

Manhattan: you can’t walk diagonally (as the crow flies).

With Manhattan distance, when deciding on boundaries based on distances to points, you may end up with boundaries which are equidistant to two points.

Distance-weighted k-NN

Weigh the vote of each neighbour by its distance to the observation. The further away it is, the less it gets counted for.
Can help break ties when k is even.
Can help deal with noisy data and outliers

$t^* \leftarrow \frac{\sum_{i=1}^{k} w_i t_i}{\sum_{i=1}^{k} w_i}$

$w_i = \frac{1}{d(\textbf{x}_i,\textbf{x}^*)^2}$

Lecture 6 – January 19th, 2018

Missed lecture.

Lecture 7 – January 22nd, 2018

Classification: Binary and Multi-class (Multinomial)

For binary classifying, you have True Positives, True Negatives for correct classifying, and False Positives, and False Negatives for incorrect classification.

You might want to care more about false negatives than false positives. This must be stated in how you measure loss (weighing false negatives more heavily).

$\textrm{Precision} = \frac{\textrm{True Positives}}{\textrm{True Positives} + \textrm{False Positives}}$

$\textrm{Recall} = \frac{\textrm{True Positives}}{\textrm{True Positives} + \textrm{False Negatives}}$

The ROC Curve: Receiver Operating Characteristic curve

Ridge regression

To reduce the size of the parameter space you can penalise large parameters by including them in your loss function. By picking different norms you’ll end up with different results. Ridge regression uses the L2 ball, for the Lasso you use the L1 ball, which works well for sparse high dimensional data.

Lecture 8 – January 25th, 2018

No notes.

Lecture 9 – January 26th, 2018

Missed lecture.

Lecture 10 – January 29th, 2018

Improved k-NN

A weighted distance function:

$\textrm{Dist}(\textbf{x}_1, \textbf{x}_2) = (\textbf{x}_1 - \textbf{x}_2)^\textrm{T} \textbf{W} (\textbf{x}_1 - \textbf{x}_2)$

Bias-Variance decomposition

Trivial…

Lecture 11 – February 1st, 2018

By only getting $\hat{w}$ and throwing away all the other potential solutions, you don’t know about the stability of the solutions.

Bayesian Inference

Idea: treat $\textbf{w}$ as a random variable. E.g. $p(\textbf{w}|\textrm{stuff})$ .

$\textrm{E}_{p(\textbf{w}|\textrm{stuff})}f(\textbf{w}) = \int f(\textbf{w})p(\textbf{w}|\textrm{stuff})d\textbf{w}$

Lecture 12 – February 2nd, 2018

Missed lecture.