What is it?
Given a search space (set of all possible solutions) and an objective function (quality criteria), find the best solution for a given criteria. Formally,
\[\textrm{min/max } \mathbb{F}:\Omega \mapsto \mathbb{R}\]
\[x \mapsto \mathbb{F}(x)\]
subject to,
\[g_i(x) \le 0\]
\[h_i(x) = 0\]
where,
\( \Omega: \textrm{search space} \)
\(x: \textrm{domain}\)
\(g_i(x): \textrm{inequality constraint}\)
\(h_i(x): \textrm{equality constraint}\)
Examples:
1st order derivability or differentiability
Assuming n=1, let \(f: \mathbb{R} \mapsto \mathbb{R}\)
We say that \(f\) is differentiable in \(x\) if,
\[ \lim_{h \to 0} \frac{f(x+h) - f(x)}h \textrm{ exists} \]
This limit is denoted by \(f'(x)\) and is called derivative of \(f\) in \(x\).
Taylor's 1st order approximation
If \(f\) is differentiable in \(x\), then the first order Taylor's expansion is given by
\[f(x+h)=f(x)+f'(x)h+o(||h||)\]
For \(h\) small enough, \(h \mapsto f(x+h)\) is approximated by,
\[h \mapsto f(x) + f'(x)h \]
giving the first order approximation of \(f\).
Gradient
We can generalize the notion of derivative to hihger dimensions. Given \(f: \mathbb{R}^n \mapsto \mathbb{R}\),
\[\nabla f_x = \left(\begin{array}{c} \frac{\partial f(x)}{\partial x_1} \\ . \\.\\.\frac{\partial f(x)}{\partial x_n} \end{array}\right) \]
The gradient of a differentiable function is orthogonal to the level sets.
2nd order derivability or differentiability
Let \(f: \mathbb{R}^n \mapsto \mathbb{R}\) be differentiable on \(\mathbb{R}\) and let,
\[f:x \mapsto f'(x) \]
be its derivative function.
Now if \(f'\) is derivable, then we denote \(f''(x)\) as the second order derivative of \(f \).
Taylor's 2nd order approximation
If the second order derivative of \(f\) exists then the second order Taylor expansion is given by,
\[f(x+h)=f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 + o(||h||^2)\]
For \(h\) small enough, we get the quadratic approximation of \(f\),
\[h \mapsto f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \]
Hessian
Again, we can generalize the second order derivative to functions \(f: \mathbb{R}^n \mapsto \mathbb{R}\) with the notion of Hessian matrix,
\[ \mathbb{H}(x) = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & . & . & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\
. & . & . & . \\
. & . & . & . \\
\frac{\partial^2 f(x)}{\partial x_n \partial x_1} & . & . & \frac{\partial^2 f(x)}{\partial x_n^2} \end{bmatrix}\]
If \(f(x) = \frac{1}{2}x^TAx\) with \(A\) being symmetric, then
\[ \mathbb{H}(f(x)) = \nabla^2 f = A \]
Local minima
\[x^*: \exists \textrm{ a neighborhood } V \textrm{ of } x^* \textrm{ s.t. }\]
\[\forall x \in V: f(x) \ge f(x^*)\]
given,
\(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)
\(f'(x) = 0 \textrm{ at optimal points }\)
Global minima
\[\forall x \in \Omega: f(x) \ge f(x^*)\]
given,
\(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)
\(f'(x) = 0 \textrm{ at optimal points }\)
Optimality conditions
Assume that \(f\) is twice continuously differentiable.
Necessary conditions:
If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive semi-definite.
Sufficient conditions:
If \(x^*\) which satisfies \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive-definite, then \(x^*\) is a strict local minimum
Convex functions
Theorem:
Let \(f: U \subset \mathbb{R}^n \mapsto \mathbb{R} \), then \(f\) is convex if \(\forall t \in [0,1] \),
\[ f(tx + (1-t)y) \le tf(x) + (1-t)y \]
Theorem:
If \(f\) is differentiable, then it is convex iff \(\forall x,y\),
\[ f(y) - f(x) \ge \nabla f(x).(y-x) \]
Theorem:
If \(f\) is continuously differentiable twice, then it is convex iff \( \nabla^2 f(x) \textrm{ is positive semi-definite } \forall x\).
For differentiable and convex functions, critical points are global minima of the function.
Difficulties
Functions can be difficult to optimize due to:
Gradient direction vs Newton direction
The gradient direction \(\nabla F(x)\) defines the direction of maximum ascend. The negative of the gradient then gives us a amximum descent direction and it directly points towards the optimum i.f.f \(\nabla^2 f(x) = \mathbb{H} = I \)and the condition number of the Hessian is equal to 1. But in cases with poor conditioning, gradient descent becomes slow to converge.
For convex quadratic functions, the Newton direction points towards the optimum independent of the condition number of the Hessian matrix. For other functions, Newton direction will not always point towards the optimum but is still a good direction to follow.
Recall the second order approximation,
\[f(x+h) = f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \]
We seek a step such that \(\nabla f(x+h) = 0\). This implies that,
\[f'(x) + f''(x)h = 0 \]
\[\implies h = - [f''(x)]^{-1} f'(x) \]
gives the Newton direction.
In some settings, we can compute the Newton direction analytically in which case we should. Yet we need to approximate numerically \(\nabla^2 f(x)\) and invert it which can be very expensive.
Quasi-Newton methods
In these methods, we try to get the best of Newton methods but avoiding expensive computation by updating \(\mathbb{H}_t\) iteratively using \(\nabla f(x_t)\) and this lets us approximate the inverse of \(\nabla^2 f(x_t)\),
\[ x_{t+1} = x_t - \sigma_t \mathbb{H}_t \nabla f(x_t) \]
Examples include BFGS, L-BFGS.
Stochastic gradient descent
With gradient descent,
\[ \nabla Q(w) = \frac{1}{N} \sum_{i=1}^N \nabla Q_i (w) \] and
\[ w_{t+1} = w_t - \sigma_t \nabla Q(w) \]
Now typically \(N\) can be very large and computation of all \(\nabla Q_i(w)\) would be expensive. So instead, we can use an approximation of \(\nabla Q(w)\),
\[ \nabla Q(w) \approx \nabla Q_i(w) \textrm{, gradient of a single example} \]
or do a mini-batch,
\[ \nabla Q(w) \approx \frac{1}{n_{batch}} \sum_{i=1}^{n_{batch}} \nabla Q_i (w) \]
This is the idea of stochastic gradient descent.
Constrained optimization
Given a constrained optimization problem,
\[\textrm{min } f(x),f:\mathbb{R}^n \mapsto \mathbb{R}\]
subject to,
\[g_i(x) \le 0\]
then the minima can either lie in the feasible region, or in the non-feasible region. In the former case, the constraint \(g_i(x)\) plays no role and is then referred to as an inactive constraint. In the latter case, the best minima we can get would be at the boundary of the constraint and it is then referred to as an active constraint. In this case, \(\nabla f(x^*) \textrm{ & } \nabla g(x^*) \) would be co-linear i.e.,
\[ \exists \lambda \in \mathbb{R} \textrm{ s.t. }\]
\[ \nabla f(x^*) = - \lambda \nabla g(x^*) \]
\[ \implies \nabla f(x^*) + \lambda \nabla g(x^*) = 0 \]
where \(\lambda\) is known as the Lagrange-multiplier and
if the constraint is inactive, \(\lambda = 0\)
if the constraint is active, \(\lambda > 0\)
In general, the following (KKT conditions) must hold,
\[\lambda \ge 0\]
\[g(x) \le 0\]
\[\lambda g(x) = 0\]