Machine Learning

MathJax example

Why?
What changes with Machine Learning from Statistics? There is a slight caveat with respect to the end goal. Statistics and its tools focus on finding knowledge about the world. Machine Learning is more about making predictions.

Linear Regression
Linear regression is simple enough to understand notions of machine learning. For some given function of data, the basic idea is to find parameters (slope and offset) using part of the data which gives it the best fit-'learning', and there on is expected to also fit some other related data with some error-'predictions'. This particular task is simple enough and can be solved analytically. But we look at the machine learning paradigm which allows us to build on more complicated tasks.
Given \(x^d_1, x^d_2,...,x^d_N\), we want to find parameters \(\theta_{1 \times D}\) s.t. the Mean-Squared Error is lowest between the output of \(f(x_i) = \theta x_i + b\) for \(i=1,2,...,N\) and the actual known output \(y_i\): \[\textrm{min} \quad L(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2\] Again, we can solve for this analytically but we want to define a numerical procedure that would also work with complex problems in higher dimensions.
Gradient Descent: The gradient of a function at a given point gives the direction of maximum increase of that function. Gradients can be efficiently computed numerically atleast for convex quadratic functions (this is the case of MSE). As such, we can use the negative gradient to find and iteratively update the parameters in the negative gradient direction \(\theta_{1 \times D}\) after an initial guess, which would cause our loss function to MSE decrease until we arrive at an acceptable minima of the loss function. \[\nabla L(\theta) = \frac{2}{N} \sum_{i=1}^N (y_i - \theta x_i)x_i\] We can thus use the following update rule: \[\theta \leftarrow \theta - \eta \nabla L(\theta)\] to arrive at a minimum of the MSE function.
Note that analytically, \[\theta \rightarrow \theta_{min}\] for \[\nabla L(\theta) = 0\] \[\implies \frac{2}{N} \sum_{i=1}^N (y_i - \theta x_i)x_i = 0\] \[\implies \sum_{i=1}^N (y_i x_i - \theta {x_i}^2) = 0\] \[\implies \theta = \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^N {x_i}^2}\] \[\implies \theta = (X^T X)^{-1} X^T y\] where \(X\) is the data matrix. Note that if \((X^T X)^{-1} = I\), which would be the case if \(X\) is an orthogonal matrix (\(x_i\) are othogonal features with \(x^T_i x_j = 1 \) and thus uncorelated) with norm 1, then \(\theta = X^T y\), i.e. the regression coefficients of each predictor can be computed independently, which is a great simplification. This is one of the motivations for PCA and whitening.

Linear Model for classification
Linear classification means dividing the input space into region via decision boundaries/surfaces which are linear functions of the input vector \(X\).
There are two distinct approaches to classification:

Discriminative models

Generative models

The simpler ones, discriminant functions takes the input data vector and assigns it to one og the \(K\) classes. Consider the simplest linear discriminant function of parameters \(\theta\) for two classes, \[f(x) = \theta^T x\] such that \(x_i \in K_1 \textrm{ if } f(x_i) \ge 0 \textrm{ and } x_i \in K_2 \textrm{ if } f(x_i) < 0 \). This implies that the decision boundary is defined by \(f(x) = 0\).
Now, for any point \(x_a, x_b\) in the input space, \[f(x_a) = f(x_b) = 0\] \[\implies f(x_a) - f(x_b) = 0\] \[\implies \theta^T x_a - \theta^T x_b = 0 \] \[\implies \theta^T (x_a - x_b) = 0\] Therefore, the parameter vector \(\theta^T\) must be orthogonal to every data vector lying on the decision boundary, i.e. \(\theta^T\) determines the orientation of the decision boundary. It can also be derived that for any data point \(x_i\), \[x_i = c \frac{\theta}{||\theta||} + x_i{_\perp}\] \[\implies \theta^T x_i = \theta^T \bigg[c \frac{\theta}{||\theta||} + x_i{_\perp} \bigg]\] \[\implies f(x_i) = \theta^T c \frac{\theta}{||\theta||} + \theta^T x_i{_\perp} \] \[\implies f(x_i) = c \frac{\theta^T\theta}{||\theta||} = c \frac{||\theta||^2}{||\theta||} = c ||\theta ||\] \[\implies c = \frac{f(x_i)}{||\theta||} \] Again, following an approach similar to linear regression, the parameter vector can be determined by minimizing the MSE function. A new input can then be assigned to the class for which the output \(f(x_i)=\theta^T x_i \) is largest. But this approach performs poorly with classification problems due to the lack of robustness to outliers. Another reason for poor performance is that least-squares correspond to maximum-likelihood formulation under a gaussian assumption and binary target variables are ofcourse do not have a gaussian distribution. Hence we now look at an alternative, the perceptron criteria.

The perceptron algorithm
The perceptron criteria is defined as follows, \[L_p(\theta) = -\sum_{n \in M} \theta^T \phi(x_n) t_n\] where \(\phi(.)\) can be a non-linear transformation, \(M\) the subset of points that are incorrectly classified and \(t_n \in {-1, +1} \). Therefore, \(L_p(\theta)\) associates a zero erroe with any pattern that is correctly classified, else minimizes \(-\theta^T \phi(x_n) t_n\), i.e. the contribution to the error for a misclassified pattern is a linear function of \(\theta\) and zero elsewhere. Being a piecewise linear function, gradient descent can be applied. \[\theta_{t+1} = \theta_{t} - \eta \nabla L_p(\theta) \] \[\implies \theta_{t+1} = \theta_{t} - \eta \phi(x_n) t_n \] So, for each pattern \(x_n\), if it is correctly classified, \(\theta\) remains unchanged. If it is incorrectly classified then,
for \(K_1\) i.e. \(t_n = +1\), \(\theta_{t+1}\) decreases by \(\eta \phi(x_n)\), whereas
for \(K_2\) i.e. \(t_n = -1\), \(\theta_{t+1}\) increases by \(\eta \phi(x_n)\).
Note that as \(\theta\) changes, \(M\) changes.
The perceptron algorithm is guaranteed to converge if the data is linearly separable, else will never converge. Also it does not genralize for \(K > 2\).

Principal Component Analysis
The essence of PCA lies in the fact that across many data sets, the data points will lie close to a manifold of much lower dimensionality than that of the original data space. Thus we want to exploit this manifold structure for dimensionality reduction, lossy data compression, feature extraction and data visualization. PCA seeks orthogonal projection of data onto a lower dimensional linear space such that,

variance of the projected data is maximized

mean squared distance between the data points and their projection is minimized

Both these statements are equivalent. Here we look at the maximum variance formulation.

We need to project the data onto a space of dimensionality \(M < D\) while maximizing the variance of the projected data. Consider first the projection onto \(M=1\). We define its direction using a d-dimensional vector \(u_1\).
Since we are only interested in the direction, we constraint it to be a unit vector \(||u||^2 = 1 \implies u^T u = 1\).
Each data point \(x_i\) is then projected onto a scalar value \(u^T x_i\). Then, the mean of the projected data is, \[u^T \bar{x} = u_1^T \bigg[ \frac{1}{N} \sum_{i=1}^N x_i \bigg] \] and the variance of the projected data is, \[\frac{1}{N} \sum_{i=1}^N (u_1^T x_i - u_1^T \bar{x})^2 = u_1^T S u_1 \] where S represents the data covariance matrix.
Now, we need to maximize the variance \(u_1^T S u_1 \) w.r.t \(u_1\) subject to \(u_1^T u_1 = 1\). This is a constrained optimization problem and thus we can use Lagrange Multipliers to turn it into an unconstrained problem, \[f(u_1) = u_1^T S u_1 + \lambda (1 - u_1^T u_1) \] Putting \(\nabla f(u_1) = 0\) we get, \[2 S u_1 - 2 \lambda u_1 = 0\] \[\implies S u_1 = \lambda u_1\] This means that \(u_1\), the projection direction must be an eigenvector of the data co-variance matrix \(S\) with eigenvalues \(\lambda\). \[\implies u_1^T S u_1 = u_1^T \lambda u_1 = \lambda u_1^T u_1 = \lambda\] Hencem the variance of the projected data will be maximum in the direction of eigenvector of \(S\) with the largest eigenvalue. This would be the first principle component.
In the general case of M-dimensional projection, the optimal linear projection for which the variance of the projected data is maximized will be the \(M\) eigenvectors \(u_1, u_2, ..., u_M\) of the data co-variance matrix \(S\) corresponding to \(M\) largest eigenvalues.

When we project the data onto the principle compnenents which are orthogonal, the resulting representation has a diagonal co-variance matrix which implies that the individual features are now uncorelated. Thus we kind of disentangle the unknown factors of variation in the underlying data. This disentangling takes place by finding a rotation of the input space that aligns the principal axes of variance with the basis of the new representation.

This uncorelation of data gives nice numerical properties since features with strong co-relation makes the data co-variance matrix \(X^T X\) (ssuming zero mean) to be ill-conditioned (large eigenvalues in some directions, very small in other).

The Kernel Trick
Many linear models use the data \(x\) only through inner products \(X^T X\). Because of this, such models can be made non-linear in a very general way. This is because in general, we can map our data \(x\) using some non-linear, higher dimension function \(\phi(.)\). Now, if our model us using the data in a dot product, then with the mapping, we have \(<\phi(x),\phi(x')>\). Now, Mercer's theorem states that a symmetric function mapping \(k(x,x')\) if is positive semi-defnite, can be expressed as an inner product \(k(x,x') = <\phi(x),\phi(x')>\). This is called a kernel.

Note that \(\phi(.)\) is a mapping in higher-dimensional where as the function \(k(.)\) as described is a function of the original data in its original dimensions. This is powerful because this means we can replace \(<\phi(x),\phi(x')>\) in algorithms with \(k(.)\) which gives us an implicit feature mapping while performing computations only on the original data.

Say you have 300 features and you decide to use a feature map with a 3rd order polynomial, you would have about \(300^3\) dimensions after the mapping. That’s difficult to deal with computationally. With standard kernel, you would have the same effect, but you would just be dealing with a function in the original dimension: k(x,x') because mercer condition ensures that certain \(\phi(x).\phi(x')=k(x,x')\).

So we just use this hack that certain dot products exist (higher dimensional) which is equal to this special function (original dimensional) and the fact that most of ML models have a dot product somewhere. So \(\phi(x)\) is expensive. \(\phi(x).\phi(x’)\) although expensive, but yet inexpensive relatively just because \(\phi(x).\phi(x’)\) can be replaced by \(k(x,x')\).

Also, we can just use the standard kernels or their combinations (guaranteed to respect Mercer's conditions) without thinking about how to come up with a feature map \(\phi(.)\). To put it concretely, consider a linear regression model, \[ L(\theta) = \frac{1}{2} \sum (\theta^T \phi(x_i) - y_i)^2 + \frac{\lambda}{2} \theta^T \theta \] putting \(\nabla L(\theta) = 0\) \[ \theta = -\frac{1}{\lambda} \sum (\theta^T \phi(x_i) - y_i) \phi(x_i) \] Let \( \alpha_i = -\frac{1}{\lambda} \sum (\theta^T \phi(x_i) - y_i) \) \[ \implies \theta = -\frac{1}{\lambda} \sum \alpha_i \phi(x_i) = \Phi^T a \] where \( \Phi(x) \) is the data design matrix as before just with an applied mapping \(\phi(.)\). Thus, we have a dual representation using \(\theta = \Phi^T a\), \[ L(a) = \frac{1}{2} ( a^T \Phi \Phi^T - y )^2 + \frac{\lambda}{2} a^T \Phi \Phi^T a \] \[ \qquad \qquad \qquad \qquad \qquad = \frac{1}{2} a^T \Phi \Phi^T \Phi \Phi^T a + \frac{1}{2} y^T y + a^T \Phi \Phi^T y + \frac{\lambda}{2} a^T \Phi \Phi^T a \] Define \(K = \Phi \Phi^T \) \[\therefore L(a) = \frac{1}{2} a^T K K a + \frac{1}{2} y^T y + a^T K y + \frac{\lambda}{2} a^T K a \] Again putting the gradient of the loss function to zero. We get, \[ a K K - K y + \lambda K a = 0 \] \[ \implies a(KK + \lambda K) = K y \] \[ \implies a(K + \lambda I) = y \] \[ \implies a = (K + \lambda I)^{-1} y \] \[ \therefore y = \theta^T \Phi = a^T \Phi \Phi = (K + \lambda I)^{-1} y K \] Thus we see the dual representation allows the solution to the least square problem to be expressed entirely in terms of the kernel. We can therefore work directly in terms of the kernel and avoid explicit introduction of the feature vector \(\phi(.)\) which allows us implicitly to use feature spaces of high dimensionality. And this dual representation based on the kernel is a property of many linear models (with their quadratic solutions exhibiting fast convergence, only global optima) including perceptron, SVM and more. To sum up, the kernel trick is powerful for the following reasons,

kernel function \(k\) admits implementation that is more computationally efficient than constructing a feature map \(\phi(.)\) and explicity applying them on the data and taking their dot products

\(k(x,x')\) is a tractable function of \(x\) even when \(\phi(x)\) might be intractable

Support Vector Machines
Recall that the perceptron algorithm is guaranteed to find a decision boundary for linearly separable data. But the results depend highly upon the initialization. SVM is a more systematic and an optimal approach for finding the decision boundary - in a way that maximizes the buffer around.
Given training data \( (x^1, t^1), (x^2, t^2),...,(x^n, t^n) \), a 2-class problem with linear model is: \[ y(x) = \theta^T \phi(x) + b \] Assume the problem to be linearly separable in the feature space. Then \( \exists w \in \mathbb{R}^d, b \in \mathbb{R}\) such that \( t_n y(x_n) > 0 \) for all training points (just think of the product of the sign you would get if the class was correctly or incorrectly identified).

Now, we are interested in \(\theta, b\) s.t. the margin \(M\) (the perpendicular distance between the decision boundary amd the closest of the data points) is maximized.
Again, as previously discussed, for any data point \(x_i\), \[x_i = x_i{_\perp} + M ||\theta||\] \[\implies (\theta^T x_i + b) = (\theta^T x_i{_\perp} + b) + M ||\theta|| \] \[\implies M ||\theta|| = \theta^T x_i + b \] \[\implies f(x_i) = c \frac{\theta^T\theta}{||\theta||} = c \frac{||\theta||^2}{||\theta||} = c ||\theta ||\] \[\implies c = \frac{ \theta^T x_i + b }{||\theta||} = \frac{y(x_i)}{||\theta||} \] Since we are only interested in correctly classified data points, i.e. \( t_i y(x_i) > 0 \forall i \in 1,2,...,n \implies M_i = \frac{t_i y(x_i)}{||\theta||}\). The goal of SVM is to find that hyperplane that has the maximum margin \(M\) from the closest points, i.e. we maximize the minimum distance, i.e. a mini-max problem: \[ \underset{\theta, b}{\operatorname{max}} \bigg( \underset{i}{\operatorname{min}} \frac{t_i y(x_i)}{||\theta||} \bigg) \textrm{ s.t. } t_i y(x_i) > 0\] We can simplify this problem by recognizing that we must fix the norm \(||\theta|| = 1\) to pick one of the infinite solutions. We can instead fix the product \(M||\theta|| = 1 \) such that, \[ M = \frac{t_i y(x_i)}{||\theta||} \implies M||\theta|| = t_i y(x_i) = 1 \] Hence, our problem can be rewritten as, \[ \underset{\theta, b}{\operatorname{max}} \bigg( \frac{1}{||\theta||} \underset{i}{\operatorname{min}} (1) \bigg) \textrm{ s.t. } t_i y(x_i) = 1\] \[ \implies \underset{\theta, b}{\operatorname{max}} \bigg( \frac{1}{||\theta||}\bigg) \textrm{ s.t. } t_i y(x_i) = 1\] \[ \implies \underset{\theta, b}{\operatorname{min}} ||\theta||^2 \textrm{ s.t. } t_i y(x_i) = 1\] or \[ \underset{\theta, b}{\operatorname{min}} ||\theta||^2 \textrm{ s.t. } t_i y(x_i) \ge 1 \forall i \in 1,2,...,n\] This can be solved as a constrained optimization problem using the Lagrangian formulation.