Optimization

MathJax example

What is it?
Given a search space (set of all possible solutions) and an objective function (quality criteria), find the best solution for a given criteria. Formally, \[\textrm{min/max } \mathbb{F}:\Omega \mapsto \mathbb{R}\] \[x \mapsto \mathbb{F}(x)\] subject to, \[g_i(x) \le 0\] \[h_i(x) = 0\] where,
\( \Omega: \textrm{search space} \)
\(x: \textrm{domain}\)
\(g_i(x): \textrm{inequality constraint}\)
\(h_i(x): \textrm{equality constraint}\)

Examples:

  • Combinatorial optimization (knapsack problem, traveling salesman problem)
  • Constrained continuous optimization (optimizing a 2-phase nozzle, design of a launcher)
  • Stochastic optimization (covariance matrix adaptation strategy for optimization)
  • Supervised learning (linear regression, classification with SVM)
  • Hyperparameter tuning
  • Interactive optimization (coffee tasting problem)

    We observe that there are many problems with different properties. In practice, most can be categorized for which optimal algorithms exist. The categorization exists as follows: Here we discuss notions of continuous optimization.

    1st order derivability or differentiability
    Assuming n=1, let \(f: \mathbb{R} \mapsto \mathbb{R}\)
    We say that \(f\) is differentiable in \(x\) if, \[ \lim_{h \to 0} \frac{f(x+h) - f(x)}h \textrm{ exists} \] This limit is denoted by \(f'(x)\) and is called derivative of \(f\) in \(x\).

    Taylor's 1st order approximation
    If \(f\) is differentiable in \(x\), then the first order Taylor's expansion is given by \[f(x+h)=f(x)+f'(x)h+o(||h||)\] For \(h\) small enough, \(h \mapsto f(x+h)\) is approximated by, \[h \mapsto f(x) + f'(x)h \] giving the first order approximation of \(f\).

    Gradient
    We can generalize the notion of derivative to hihger dimensions. Given \(f: \mathbb{R}^n \mapsto \mathbb{R}\), \[\nabla f_x = \left(\begin{array}{c} \frac{\partial f(x)}{\partial x_1} \\ . \\.\\.\frac{\partial f(x)}{\partial x_n} \end{array}\right) \] The gradient of a differentiable function is orthogonal to the level sets.

    2nd order derivability or differentiability
    Let \(f: \mathbb{R}^n \mapsto \mathbb{R}\) be differentiable on \(\mathbb{R}\) and let, \[f:x \mapsto f'(x) \] be its derivative function. Now if \(f'\) is derivable, then we denote \(f''(x)\) as the second order derivative of \(f \).

    Taylor's 2nd order approximation
    If the second order derivative of \(f\) exists then the second order Taylor expansion is given by, \[f(x+h)=f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 + o(||h||^2)\] For \(h\) small enough, we get the quadratic approximation of \(f\), \[h \mapsto f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \]

    Hessian
    Again, we can generalize the second order derivative to functions \(f: \mathbb{R}^n \mapsto \mathbb{R}\) with the notion of Hessian matrix, \[ \mathbb{H}(x) = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & . & . & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\ . & . & . & . \\ . & . & . & . \\ \frac{\partial^2 f(x)}{\partial x_n \partial x_1} & . & . & \frac{\partial^2 f(x)}{\partial x_n^2} \end{bmatrix}\] If \(f(x) = \frac{1}{2}x^TAx\) with \(A\) being symmetric, then \[ \mathbb{H}(f(x)) = \nabla^2 f = A \]

    Local minima
    \[x^*: \exists \textrm{ a neighborhood } V \textrm{ of } x^* \textrm{ s.t. }\] \[\forall x \in V: f(x) \ge f(x^*)\] given,
    \(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)
    \(f'(x) = 0 \textrm{ at optimal points }\)

    Global minima
    \[\forall x \in \Omega: f(x) \ge f(x^*)\] given,
    \(f: \mathbb{R} \rightarrow \mathbb{R} \textrm{ is differentiable }\)
    \(f'(x) = 0 \textrm{ at optimal points }\)

    Optimality conditions
    Assume that \(f\) is twice continuously differentiable.

    Necessary conditions:
    If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive semi-definite.

    Sufficient conditions:
    If \(x^*\) which satisfies \(\nabla f(x^*) = 0 \textrm{ and } \nabla^2 f(x^*) \) is positive-definite, then \(x^*\) is a strict local minimum

    Convex functions
    Theorem:
    Let \(f: U \subset \mathbb{R}^n \mapsto \mathbb{R} \), then \(f\) is convex if \(\forall t \in [0,1] \), \[ f(tx + (1-t)y) \le tf(x) + (1-t)y \] Theorem:
    If \(f\) is differentiable, then it is convex iff \(\forall x,y\), \[ f(y) - f(x) \ge \nabla f(x).(y-x) \] Theorem:
    If \(f\) is continuously differentiable twice, then it is convex iff \( \nabla^2 f(x) \textrm{ is positive semi-definite } \forall x\).

    For differentiable and convex functions, critical points are global minima of the function.

    Difficulties
    Functions can be difficult to optimize due to:

  • non-linearity, non-quadratic, non-convex
  • ruggedness
  • dimensionality
  • non-separability
  • ill-conditioning

    Gradient direction vs Newton direction
    The gradient direction \(\nabla F(x)\) defines the direction of maximum ascend. The negative of the gradient then gives us a amximum descent direction and it directly points towards the optimum i.f.f \(\nabla^2 f(x) = \mathbb{H} = I \)and the condition number of the Hessian is equal to 1. But in cases with poor conditioning, gradient descent becomes slow to converge.

    For convex quadratic functions, the Newton direction points towards the optimum independent of the condition number of the Hessian matrix. For other functions, Newton direction will not always point towards the optimum but is still a good direction to follow. Recall the second order approximation, \[f(x+h) = f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 \] We seek a step such that \(\nabla f(x+h) = 0\). This implies that, \[f'(x) + f''(x)h = 0 \] \[\implies h = - [f''(x)]^{-1} f'(x) \] gives the Newton direction.
    In some settings, we can compute the Newton direction analytically in which case we should. Yet we need to approximate numerically \(\nabla^2 f(x)\) and invert it which can be very expensive.

    Quasi-Newton methods
    In these methods, we try to get the best of Newton methods but avoiding expensive computation by updating \(\mathbb{H}_t\) iteratively using \(\nabla f(x_t)\) and this lets us approximate the inverse of \(\nabla^2 f(x_t)\), \[ x_{t+1} = x_t - \sigma_t \mathbb{H}_t \nabla f(x_t) \] Examples include BFGS, L-BFGS.

    Stochastic gradient descent
    With gradient descent, \[ \nabla Q(w) = \frac{1}{N} \sum_{i=1}^N \nabla Q_i (w) \] and \[ w_{t+1} = w_t - \sigma_t \nabla Q(w) \] Now typically \(N\) can be very large and computation of all \(\nabla Q_i(w)\) would be expensive. So instead, we can use an approximation of \(\nabla Q(w)\), \[ \nabla Q(w) \approx \nabla Q_i(w) \textrm{, gradient of a single example} \] or do a mini-batch, \[ \nabla Q(w) \approx \frac{1}{n_{batch}} \sum_{i=1}^{n_{batch}} \nabla Q_i (w) \] This is the idea of stochastic gradient descent.

    Constrained optimization
    Given a constrained optimization problem, \[\textrm{min } f(x),f:\mathbb{R}^n \mapsto \mathbb{R}\] subject to, \[g_i(x) \le 0\] then the minima can either lie in the feasible region, or in the non-feasible region. In the former case, the constraint \(g_i(x)\) plays no role and is then referred to as an inactive constraint. In the latter case, the best minima we can get would be at the boundary of the constraint and it is then referred to as an active constraint. In this case, \(\nabla f(x^*) \textrm{ & } \nabla g(x^*) \) would be co-linear i.e., \[ \exists \lambda \in \mathbb{R} \textrm{ s.t. }\] \[ \nabla f(x^*) = - \lambda \nabla g(x^*) \] \[ \implies \nabla f(x^*) + \lambda \nabla g(x^*) = 0 \] where \(\lambda\) is known as the Lagrange-multiplier and
    if the constraint is inactive, \(\lambda = 0\)
    if the constraint is active, \(\lambda > 0\)
    In general, the following (KKT conditions) must hold, \[\lambda \ge 0\] \[g(x) \le 0\] \[\lambda g(x) = 0\]