Why?
What changes with Machine Learning from Statistics? There is a slight caveat with respect to the end goal. Statistics and its tools focus on finding knowledge about the world. Machine Learning is more about making predictions.
Linear Regression
Linear regression is simple enough to understand notions of machine learning. For some given function of data, the basic idea is to find parameters (slope and offset) using part of the data which gives it the best fit-'learning', and there on is expected to also fit some other related data with some error-'predictions'. This particular task is simple enough and can be solved analytically. But we look at the machine learning paradigm which allows us to build on more complicated tasks.
Given \(x^d_1, x^d_2,...,x^d_N\), we want to find parameters \(\theta_{1 \times D}\) s.t. the Mean-Squared Error is lowest between the output of \(f(x_i) = \theta x_i + b\) for \(i=1,2,...,N\) and the actual known output \(y_i\):
\[\textrm{min} \quad L(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2\]
Again, we can solve for this analytically but we want to define a numerical procedure that would also work with complex problems in higher dimensions.
Gradient Descent:
The gradient of a function at a given point gives the direction of maximum increase of that function. Gradients can be efficiently computed numerically atleast for convex quadratic functions (this is the case of MSE). As such, we can use the negative gradient to find and iteratively update the parameters in the negative gradient direction \(\theta_{1 \times D}\) after an initial guess, which would cause our loss function to MSE decrease until we arrive at an acceptable minima of the loss function.
\[\nabla L(\theta) = \frac{2}{N} \sum_{i=1}^N (y_i - \theta x_i)x_i\]
We can thus use the following update rule:
\[\theta \leftarrow \theta - \eta \nabla L(\theta)\]
to arrive at a minimum of the MSE function.
Note that analytically,
\[\theta \rightarrow \theta_{min}\]
for
\[\nabla L(\theta) = 0\]
\[\implies \frac{2}{N} \sum_{i=1}^N (y_i - \theta x_i)x_i = 0\]
\[\implies \sum_{i=1}^N (y_i x_i - \theta {x_i}^2) = 0\]
\[\implies \theta = \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^N {x_i}^2}\]
\[\implies \theta = (X^T X)^{-1} X^T y\]
where \(X\) is the data matrix.
Note that if \((X^T X)^{-1} = I\), which would be the case if \(X\) is an orthogonal matrix (\(x_i\) are othogonal features with \(x^T_i x_j = 1 \) and thus uncorelated) with norm 1, then \(\theta = X^T y\), i.e. the regression coefficients of each predictor can be computed independently, which is a great simplification. This is one of the motivations for PCA and whitening.
Linear Model for classification
Linear classification means dividing the input space into region via decision boundaries/surfaces which are linear functions of the input vector \(X\).
There are two distinct approaches to classification:
The perceptron algorithm
The perceptron criteria is defined as follows,
\[L_p(\theta) = -\sum_{n \in M} \theta^T \phi(x_n) t_n\]
where \(\phi(.)\) can be a non-linear transformation, \(M\) the subset of points that are incorrectly classified and \(t_n \in {-1, +1} \). Therefore, \(L_p(\theta)\) associates a zero erroe with any pattern that is correctly classified, else minimizes \(-\theta^T \phi(x_n) t_n\), i.e. the contribution to the error for a misclassified pattern is a linear function of \(\theta\) and zero elsewhere. Being a piecewise linear function, gradient descent can be applied.
\[\theta_{t+1} = \theta_{t} - \eta \nabla L_p(\theta) \]
\[\implies \theta_{t+1} = \theta_{t} - \eta \phi(x_n) t_n \]
So, for each pattern \(x_n\), if it is correctly classified, \(\theta\) remains unchanged. If it is incorrectly classified then,
for \(K_1\) i.e. \(t_n = +1\), \(\theta_{t+1}\) decreases by \(\eta \phi(x_n)\), whereas
for \(K_2\) i.e. \(t_n = -1\), \(\theta_{t+1}\) increases by \(\eta \phi(x_n)\).
Note that as \(\theta\) changes, \(M\) changes.
The perceptron algorithm is guaranteed to converge if the data is linearly separable, else will never converge. Also it does not genralize for \(K > 2\).
Principal Component Analysis
The essence of PCA lies in the fact that across many data sets, the data points will lie close to a manifold of much lower dimensionality than that of the original data space. Thus we want to exploit this manifold structure for dimensionality reduction, lossy data compression, feature extraction and data visualization. PCA seeks orthogonal projection of data onto a lower dimensional linear space such that,
The Kernel Trick
Many linear models use the data \(x\) only through inner products \(X^T X\). Because of this, such models can be made non-linear in a very general way. This is because in general, we can map our data \(x\) using some non-linear, higher dimension function \(\phi(.)\). Now, if our model us using the data in a dot product, then with the mapping, we have \(<\phi(x),\phi(x')>\). Now, Mercer's theorem states that a symmetric function mapping \(k(x,x')\) if is positive semi-defnite, can be expressed as an inner product \(k(x,x') = <\phi(x),\phi(x')>\). This is called a kernel.
Note that \(\phi(.)\) is a mapping in higher-dimensional where as the function \(k(.)\) as described is a function of the original data in its original dimensions. This is powerful because this means we can replace \(<\phi(x),\phi(x')>\) in algorithms with \(k(.)\) which gives us an implicit feature mapping while performing computations only on the original data.
Say you have 300 features and you decide to use a feature map with a 3rd order polynomial, you would have about \(300^3\) dimensions after the mapping. That’s difficult to deal with computationally. With standard kernel, you would have the same effect, but you would just be dealing with a function in the original dimension: k(x,x') because mercer condition ensures that certain \(\phi(x).\phi(x')=k(x,x')\).
So we just use this hack that certain dot products exist (higher dimensional) which is equal to this special function (original dimensional) and the fact that most of ML models have a dot product somewhere. So \(\phi(x)\) is expensive. \(\phi(x).\phi(x’)\) although expensive, but yet inexpensive relatively just because \(\phi(x).\phi(x’)\) can be replaced by \(k(x,x')\).
Also, we can just use the standard kernels or their combinations (guaranteed to respect Mercer's conditions) without thinking about how to come up with a feature map \(\phi(.)\).
To put it concretely, consider a linear regression model,
\[ L(\theta) = \frac{1}{2} \sum (\theta^T \phi(x_i) - y_i)^2 + \frac{\lambda}{2} \theta^T \theta \]
putting \(\nabla L(\theta) = 0\)
\[ \theta = -\frac{1}{\lambda} \sum (\theta^T \phi(x_i) - y_i) \phi(x_i) \]
Let \( \alpha_i = -\frac{1}{\lambda} \sum (\theta^T \phi(x_i) - y_i) \)
\[ \implies \theta = -\frac{1}{\lambda} \sum \alpha_i \phi(x_i) = \Phi^T a \]
where \( \Phi(x) \) is the data design matrix as before just with an applied mapping \(\phi(.)\).
Thus, we have a dual representation using \(\theta = \Phi^T a\),
\[ L(a) = \frac{1}{2} ( a^T \Phi \Phi^T - y )^2 + \frac{\lambda}{2} a^T \Phi \Phi^T a \]
\[ \qquad \qquad \qquad \qquad \qquad = \frac{1}{2} a^T \Phi \Phi^T \Phi \Phi^T a + \frac{1}{2} y^T y + a^T \Phi \Phi^T y + \frac{\lambda}{2} a^T \Phi \Phi^T a \]
Define \(K = \Phi \Phi^T \)
\[\therefore L(a) = \frac{1}{2} a^T K K a + \frac{1}{2} y^T y + a^T K y + \frac{\lambda}{2} a^T K a \]
Again putting the gradient of the loss function to zero. We get,
\[ a K K - K y + \lambda K a = 0 \]
\[ \implies a(KK + \lambda K) = K y \]
\[ \implies a(K + \lambda I) = y \]
\[ \implies a = (K + \lambda I)^{-1} y \]
\[ \therefore y = \theta^T \Phi = a^T \Phi \Phi = (K + \lambda I)^{-1} y K \]
Thus we see the dual representation allows the solution to the least square problem to be expressed entirely in terms of the kernel. We can therefore work directly in terms of the kernel and avoid explicit introduction of the feature vector \(\phi(.)\) which allows us implicitly to use feature spaces of high dimensionality. And this dual representation based on the kernel is a property of many linear models (with their quadratic solutions exhibiting fast convergence, only global optima) including perceptron, SVM and more.
To sum up, the kernel trick is powerful for the following reasons,
Support Vector Machines
Recall that the perceptron algorithm is guaranteed to find a decision boundary for linearly separable data. But the results depend highly upon the initialization. SVM is a more systematic and an optimal approach for finding the decision boundary - in a way that maximizes the buffer around.
Given training data \( (x^1, t^1), (x^2, t^2),...,(x^n, t^n) \), a 2-class problem with linear model is:
\[ y(x) = \theta^T \phi(x) + b \]
Assume the problem to be linearly separable in the feature space. Then \( \exists w \in \mathbb{R}^d, b \in \mathbb{R}\) such that \( t_n y(x_n) > 0 \) for all training points (just think of the product of the sign you would get if the class was correctly or incorrectly identified).
Now, we are interested in \(\theta, b\) s.t. the margin \(M\) (the perpendicular distance between the decision boundary amd the closest of the data points) is maximized.
Again, as previously discussed, for any data point \(x_i\),
\[x_i = x_i{_\perp} + M ||\theta||\]
\[\implies (\theta^T x_i + b) = (\theta^T x_i{_\perp} + b) + M ||\theta|| \]
\[\implies M ||\theta|| = \theta^T x_i + b \]
\[\implies f(x_i) = c \frac{\theta^T\theta}{||\theta||} = c \frac{||\theta||^2}{||\theta||} = c ||\theta ||\]
\[\implies c = \frac{ \theta^T x_i + b }{||\theta||} = \frac{y(x_i)}{||\theta||} \]
Since we are only interested in correctly classified data points, i.e. \( t_i y(x_i) > 0 \forall i \in 1,2,...,n \implies M_i = \frac{t_i y(x_i)}{||\theta||}\). The goal of SVM is to find that hyperplane that has the maximum margin \(M\) from the closest points, i.e. we maximize the minimum distance, i.e. a mini-max problem:
\[ \underset{\theta, b}{\operatorname{max}} \bigg( \underset{i}{\operatorname{min}} \frac{t_i y(x_i)}{||\theta||} \bigg) \textrm{ s.t. } t_i y(x_i) > 0\]
We can simplify this problem by recognizing that we must fix the norm \(||\theta|| = 1\) to pick one of the infinite solutions. We can instead fix the product \(M||\theta|| = 1 \) such that,
\[ M = \frac{t_i y(x_i)}{||\theta||} \implies M||\theta|| = t_i y(x_i) = 1 \]
Hence, our problem can be rewritten as,
\[ \underset{\theta, b}{\operatorname{max}} \bigg( \frac{1}{||\theta||} \underset{i}{\operatorname{min}} (1) \bigg) \textrm{ s.t. } t_i y(x_i) = 1\]
\[ \implies \underset{\theta, b}{\operatorname{max}} \bigg( \frac{1}{||\theta||}\bigg) \textrm{ s.t. } t_i y(x_i) = 1\]
\[ \implies \underset{\theta, b}{\operatorname{min}} ||\theta||^2 \textrm{ s.t. } t_i y(x_i) = 1\]
or
\[ \underset{\theta, b}{\operatorname{min}} ||\theta||^2 \textrm{ s.t. } t_i y(x_i) \ge 1 \forall i \in 1,2,...,n\]
This can be solved as a constrained optimization problem using the Lagrangian formulation.