**What is it?**

Given a search space (set of all possible solutions) and an objective function (quality criteria), find the best solution for a given criteria. Formally,
\[\textrm{min/max } \mathbb{F}:\Omega \mapsto \mathbb{R}\]
\[x \mapsto \mathbb{F}(x)\]
subject to,
\[g_i(x) \le 0\]
\[h_i(x) = 0\]
where,

\( \Omega: \textrm{search space} \)

\(x: \textrm{domain}\)

\(g_i(x): \textrm{inequality constraint}\)

\(h_i(x): \textrm{equality constraint}\)

Examples:

We observe that there are many problems with different properties. In practice, most can be categorized for which optimal algorithms exist. The categorization exists as follows:

- Discrete vs Continuous
- Discrete: integer programming, Combinatorial problems
- Continuous: linear, quadratic, smooth/non-smooth, black-box
- Both discrete and continuous variales: mixed integer problems
- Categorical variables (no order)

- Unconstrained vs Constrained
- Deterministic vs Stochastic outcome of objective functions
- Single or multiple objective functions

**1st order derivability or differentiability**

Assuming n=1, let \(f: \mathbb{R} \mapsto \mathbb{R}\)

We say that \(f\) is differentiable in \(x\) if,
\[ \lim_{h \to 0} \frac{f(x+h) - f(x)}h \textrm{ exists} \]
This limit is denoted by \(f'(x)\) and is called derivative of \(f\) in \(x\).

**Taylor's 1st order approximation**

If \(f\) is differentiable in \(x\), then the first order Taylor's expansion is given by
\[f(x+h)=f(x)+f'(x)h+o(||h||)\]
For \(h\) small enough, \(h \mapsto f(x+h)\) is approximated by,
\[h \mapsto f(x) + f'(x)h \]
giving the first order approximation of \(f\).

**Gradient**

We can generalize the notion of derivative to hihger dimensions. Given \(f: \mathbb{R}^n \mapsto \mathbb{R}\),
\[\nabla f_x = \left(\begin{array}{c} \frac{\partial f(x)}{\partial x_1} \\ . \\.\\.\frac{\partial f(x)}{\partial x_n} \end{array}\right) \]
The gradient of a differentiable function is orthogonal to the level sets.

**2nd order derivability or differentiability**

Let \(f: \mathbb{R}^n \mapsto \mathbb{R}\) be differentiable on \(\mathbb{R}\) and let,
\[f:x \mapsto f'(x) \]
be its derivative function.
Now if \(f'\) is derivable, then we denote \(f''(x)\) as the second order derivative of \(f \).

**Taylor's 2nd order approximation**

If the second order derivative of \(f\) exists then the second order Taylor expansion is given by,
\[f(x+h)=f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 + o(||h||^2)\]
For \(h\) small enough, we get the quadratic approximation of \(f\),
\[h \mapsto f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \]

**Hessian**

Again, we can generalize the second order derivative to functions \(f: \mathbb{R}^n \mapsto \mathbb{R}\) with the notion of Hessian matrix,
\[ \mathbb{H}(x) = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & . & . & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\
. & . & . & . \\
. & . & . & . \\
\frac{\partial^2 f(x)}{\partial x_n \partial x_1} & . & . & \frac{\partial^2 f(x)}{\partial x_n^2} \end{bmatrix}\]
If \(f(x) = \frac{1}{2}x^TAx\) with \(A\) being symmetric, then
\[ \mathbb{H}(f(x)) = \nabla^2 f = A \]

**Local minima**

\[x^*: \exists \textrm{ a neighborhood } V \textrm{ of } x^* \textrm{ s.t. }\]
\[\forall x \in V: f(x) \ge f(x^*)\]
given,

\(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)

\(f'(x) = 0 \textrm{ at optimal points }\)

**Global minima**

\[\forall x \in \Omega: f(x) \ge f(x^*)\]
given,

\(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)

\(f'(x) = 0 \textrm{ at optimal points }\)

**Optimality conditions**

Assume that \(f\) is twice continuously differentiable.
*Necessary conditions:*

If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive semi-definite.
*Sufficient conditions:*

If \(x^*\) which satisfies \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive-definite, then \(x^*\) is a strict local minimum

**Convex functions**
*Theorem:*

Let \(f: U \subset \mathbb{R}^n \mapsto \mathbb{R} \), then \(f\) is convex if \(\forall t \in [0,1] \),
\[ f(tx + (1-t)y) \le tf(x) + (1-t)y \]
*Theorem:*

If \(f\) is differentiable, then it is convex iff \(\forall x,y\),
\[ f(y) - f(x) \ge \nabla f(x).(y-x) \]
*Theorem:*

If \(f\) is continuously differentiable twice, then it is convex iff \( \nabla^2 f(x) \textrm{ is positive semi-definite } \forall x\).

For differentiable and convex functions, critical points are global minima of the function.

**Difficulties**

Functions can be difficult to optimize due to:

**Gradient direction vs Newton direction**

The gradient direction \(\nabla F(x)\) defines the direction of maximum ascend. The negative of the gradient then gives us a amximum descent direction and it directly points towards the optimum i.f.f \(\nabla^2 f(x) = \mathbb{H} = I \)and the condition number of the Hessian is equal to 1. But in cases with poor conditioning, gradient descent becomes slow to converge.

For convex quadratic functions, the Newton direction points towards the optimum independent of the condition number of the Hessian matrix. For other functions, Newton direction will not always point towards the optimum but is still a good direction to follow.
Recall the second order approximation,
\[f(x+h) = f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \]
We seek a step such that \(\nabla f(x+h) = 0\). This implies that,
\[f'(x) + f''(x)h = 0 \]
\[\implies h = - [f''(x)]^{-1} f'(x) \]
gives the Newton direction.

In some settings, we can compute the Newton direction analytically in which case we should. Yet we need to approximate numerically \(\nabla^2 f(x)\) and invert it which can be very expensive.

**Quasi-Newton methods**

In these methods, we try to get the best of Newton methods but avoiding expensive computation by updating \(\mathbb{H}_t\) iteratively using \(\nabla f(x_t)\) and this lets us approximate the inverse of \(\nabla^2 f(x_t)\),
\[ x_{t+1} = x_t - \sigma_t \mathbb{H}_t \nabla f(x_t) \]
Examples include BFGS, L-BFGS.

**Stochastic gradient descent**

With gradient descent,
\[ \nabla Q(w) = \frac{1}{N} \sum_{i=1}^N \nabla Q_i (w) \] and
\[ w_{t+1} = w_t - \sigma_t \nabla Q(w) \]
Now typically \(N\) can be very large and computation of all \(\nabla Q_i(w)\) would be expensive. So instead, we can use an approximation of \(\nabla Q(w)\),
\[ \nabla Q(w) \approx \nabla Q_i(w) \textrm{, gradient of a single example} \]
or do a mini-batch,
\[ \nabla Q(w) \approx \frac{1}{n_{batch}} \sum_{i=1}^{n_{batch}} \nabla Q_i (w) \]
This is the idea of stochastic gradient descent.

**Constrained optimization**

Given a constrained optimization problem,
\[\textrm{min } f(x),f:\mathbb{R}^n \mapsto \mathbb{R}\]
subject to,
\[g_i(x) \le 0\]
then the minima can either lie in the feasible region, or in the non-feasible region. In the former case, the constraint \(g_i(x)\) plays no role and is then referred to as an inactive constraint. In the latter case, the best minima we can get would be at the boundary of the constraint and it is then referred to as an active constraint. In this case, \(\nabla f(x^*) \textrm{ & } \nabla g(x^*) \) would be co-linear i.e.,
\[ \exists \lambda \in \mathbb{R} \textrm{ s.t. }\]
\[ \nabla f(x^*) = - \lambda \nabla g(x^*) \]
\[ \implies \nabla f(x^*) + \lambda \nabla g(x^*) = 0 \]
where \(\lambda\) is known as the Lagrange-multiplier and

if the constraint is inactive, \(\lambda = 0\)

if the constraint is active, \(\lambda > 0\)

In general, the following (KKT conditions) must hold,
\[\lambda \ge 0\]
\[g(x) \le 0\]
\[\lambda g(x) = 0\]