Linear Algebra

MathJax example

Vectors
First and foremost, disregard the notions of data points, co-ordinates, arrays, direction, magnitude, etc. associated with the term vector and consider the notion of vector spaces. A vectors space \((V, +, .)\) is a set endowed by two operations: \[ + \left\{ \begin{array}{lr} (V, V) \rightarrow V \\ (x, y) \rightarrow x+y \end{array} \right. \] and \[ . \left\{ \begin{array}{lr} (\mathbb{R}, V) \rightarrow V \\ (\alpha, x) \rightarrow \alpha x \end{array} \right. \] where \( \alpha \in \mathbb{R} \) is a scalar and the elements \( x,y \in V \) are then called vectors. Formally a number of axioms need to hold true such that:

  • \( (V,+) \) is an additive group
  • \( \alpha(x+y) = \alpha x + \alpha y \)
  • \( \alpha_1 (\alpha_2 x) = ( \alpha_1 \alpha_2 ) x \)
  • \( (\alpha_1 + \alpha_2) x = \alpha_1 x + \alpha_2 x \)

    Just keep the following in the back of your head: fundamental vector space operations are addition and scaling. Some examples of vectors spaces are:

  • \( (\mathbb{R}, +, .) \)
  • \( (\mathbb{R}^2, +, .) \)
  • \( (\mathbb{R}^3, +, .) \)
  • \( (\mathbb{R}^n, +, .) \)
    although these associations are not limited to real numbers. So to remember, anything where rules of vector addition and scaling holds is a vector space and its elements are called vectors.

    Concretely, \( x \in \mathbb{R}^n \) would be a vector such that \[ x = \begin{bmatrix} x_1 \\ x_2 \\ . \\.\\ x_n \\ \end{bmatrix} \]

    Linear combination
    Now that we have a vector space and can define elements of the vectors space, even better, we can linearly combine them to from new vectors belonging to a particular vector space. For \( x,y \in V \), we can have \( \alpha_1 x + \alpha_2 y \in V \) for any scalars \( \alpha_1, \alpha_2 \in \mathbb{R} \)

    Span
    To sum up what we know so far, given a vector space (with fundamental operations of addition and scaling), we can have a number of elements in that vector space, a list of vectors i.e. a list of vectors so to speak, which we can linearly combine to form new vectors.

    The span of a list of vectors is the set of all vectors that can be written as a linear combination of the vectors in that list denoted as \( span(v_1, v_2,...,v_n) \).
    For example,
    span of \( x = \begin{bmatrix} 2 \\ 0 \end{bmatrix} \) and \( x = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \) is \( \mathbb{R}^2 \)

    span of \( x = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \) and \( x = \begin{bmatrix} 3 \\ 0 \end{bmatrix} \) is \( \mathbb{R} \)

    Linear independence
    A list of vectors is linearly independent if none of the vectors is in the span of the other vectors (that is to say, none of the vectors in the list can be written as a linear combination of the other vectors in that list). For example,

    \( x = \begin{bmatrix} 2 \\ 0 \\ 0 \end{bmatrix} \), \( y = \begin{bmatrix} 4 \\ 0 \\ 0 \end{bmatrix} \), \( z = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \) are not linearly independent whereas,

    \( x = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \) and \( y = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \) are linearly independent.

    Proposition:

    Spanning list
    The spanning list of a vector space \(V\) is a list of vectors such that the span is equal to \(V\). For example,

    \( \Bigg\{ \begin{bmatrix} 2 \\ 0 \end{bmatrix} \), \( \begin{bmatrix} 0 \\ 1 \end{bmatrix} \Bigg\} \) is a spanning list of \( \mathbb{R}^2 \)

    Basis
    A linearly independent list that spans a vector space \(V\) is called the basis of \(V\).
    Any linearly independent list can be extended to form a basis, and any spanning list can be trimmed down to form a basis.

    Dimension
    Any vector space can have multiple basis and all will necessarily have equal length. That length is called the dimension of that vector space. For example,

    consider \( x = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \), \( y = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} \), \( z = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \)

    then \( {x,y,z} \) is a basis of \(\mathbb{R}^3\) and its dimension, \(dim(\mathbb{R}^3) = 3\).

    Linear transformation
    Now we start thinking about mapping from one vector space to another via linear functions as transformations. Consider the vector spaces \(V, W\). Then \( L:V \rightarrow W\) is a linear transformation if \(\forall v_1, v_2 \in V\), \( \alpha_1, \alpha_2 \in \mathbb{R} \), it respects the vector space operations: \[ L(\alpha_1 v_1 + \alpha_2 v_2) = \alpha_1 L(v_1) + \alpha_2 L(v_2) \] For example,

    \( L = \left\{ \begin{array}{lr} \mathbb{R}^2 \rightarrow \mathbb{R}^2 \\ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \rightarrow \begin{bmatrix} x_1 \\ 3 x_2 \end{bmatrix} \end{array} \right. \)
    is a linear transformation whereas,

    \( L = \left\{ \begin{array}{lr} \mathbb{R}^2 \rightarrow \mathbb{R}^2 \\ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \rightarrow \begin{bmatrix} x_1 \\ 5 \end{bmatrix} \end{array} \right. \)
    is not a linear transformation.

    A linear transformation is fully specified by the basis of the vector in the image.

    Rank
    The rank of a transformation \(L\) is the dimension of its range/image i.e. the length of basis vectors in the image. \[ rank(L) = dim(L(V)) \]

    Nullspace/Kernel
    The nullspace or kernel of a linear transformation is the set of vectors mapped to 0. Consider the linear transformation \( L:V \rightarrow W\) then: \[ ker(L) = {v \in V, L(V) = 0} \] For example,

    \( L = \left\{ \begin{array}{lr} \mathbb{R}^2 \rightarrow \mathbb{R}^2 \\ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \rightarrow \begin{bmatrix} x_1 \\ 0 \end{bmatrix} \end{array} \right. \)
    \( ker(L) = \Bigg\{ \begin{bmatrix} 0 \\ x_2 \end{bmatrix}, x_2 \in \mathbb{R} \Bigg\} \)

    The nullity of a linear transformation is the dimension of the space of vectors in the domain that maps to the zero vectors. \[ nullity(L) = dim(ker(L)) \]

    Rank-Nullity theorem
    The rank-nullity theorem says that the rank (range) + nullity (domain) of the transformation is equal to the dimension of its domain. Thus, for any linear transformation \( L:V \rightarrow W\): \[ rank(L) + nullity(L) = dim(V) \]

    Matrices as linear transformations
    Linear transformations can be represented as matrices which store the coefficients for the formula for the entries of \(L\). Product of matrices and vectors is then defined such that multiplying a vector by a matrix is same as applying a linear transformation to the vector. Equivalently, the product is equal to the column entries of the matrix with the entries of the vector as weights.
    Consider a matrix, \(A = (A_1, A_2,...,A_n) \) such that \(A_i\ \in \mathbb{R}^m)\). Then,

    \( L = \left\{ \begin{array}{lr} \mathbb{R}^n \rightarrow \mathbb{R}^m \\ x \rightarrow Ax \end{array} \right. \)
    is a linear transformation where \(Ax = x_1 A_1 + x_2 A_2 + ... + x_n A_n\)

    Inverse
    Now given a linear transformation, we want to know if it has an inverse. Consider a function \(f: V \rightarrow W \), where \(V\) is the domain of \(f\) and \(W\) its co-domain.
    Then the range of f is: \[ ran(f) = {f(V) | v \in V} \subset W \]
    Simply put, the range of a function are the values that get mapped in the output. For any given matrix, its range is the span of its columns.

    \(f\) is injective if different elements in \(V\) always map to different elements in \(W\), i.e. \( \forall x,y \in V, if f(x) = f(y) \implies x = y \). For a linear transformation \(L\), injectivity implies that \(ker(L) = {0} \).

    \(f\) is surjective if for every element \(y\) in the co-domain \(W\) of \(f\) there is at least one element \(x\) in the domain \(V\) such that \( f(x) = y \). For a linear transformation \(L\), surjectivity implies that \( dim(ran(L)) = rank(L) = n \).

    Any function which is both injective and surjective is called bijective which means that \( \forall y, \exists x \in V : y=f(x) \).

    If \(f\) is bijective, then there exists an inverse mapping \( f^{-1} \) such that \( f^{-1}(f(x))=x \). Thus, given a matrix \(A\) representing a linear transformation \(L\), its inverse exists iff \(L\) is bijective. This then allows us to posit that the following are equivalent:

  • \(L\) is invertible
  • \(L\) is injective
  • \(L\) is surjective

    If either statement is true, the others hold true.

    Matrix inverse
    For any given matrix \(A\), its inverse exists if \(x \rightarrow Ax \) is bijective. In this case, \(A\) satisfies \( A A^{-1} = A^{-1} A = I\)

    Diagonal matrix
    A diagonal matrix \(D\) is a square matrix with zeroes on off-diagonal terms. \(D\) is invertible iff all \(d_i \ne 0\).

    Symmetric matrix
    A given matrix A is symmetric if it satisfies \(A = A'\).

    Orthogonality
    A vector list is called orthogonal if all vectors are orthogonal to each other. A vector list is called orthonormal if all vectors are orthogonal to each other and of unit length

    A transformation \(x \rightarrow Ux \) from \( \mathbb{R}^n \rightarrow \mathbb{R}^n \) is distance-preserving if the norm of \(x\) is the same as the norm of \( Ux\) \(\forall\) \(x\). Using dot products, we can write this distance-preserving condition as \( x.x = (Ux).(Ux) \implies x'x = x'U'Ux \implies U'U = I \)

    If the transformation in addition preserves angles, then x.y must also be equal to (Ux).(Uy) \(\forall\) \(x,y\). Rewriting this equation using transposes, we see that \( x'y = (Ux)'(Uy) = x'U'Uy\) and this holds if \(U'U = I\).

    A matrix with orthonormal columns is called an orthogonal matrix and satisfies \(U'U = I\) and they preserve length: \( ||Ux||_2 = (Ux)'(Ux) = x'U'Ux = x'Ix = x'x = ||x||_2 \) Note that orthogonal matrices can't have any real eigenvectors. For orthogonal matrices, \(U' = U^{-1}\)

  • Eigenvectors corresponding to different eigenvalues must be orthogonal. Because: If v is an eigenvector of A, and w an eigenvector of A', then v and w must be orthogonal since A=A'
  • Orthogonal matrices must be symmetric. Symmetric matrices need not be orthogonal but every symmetric matrix is orthogonally diagonalizable.
  • A positive definite matrix A is a symmetric matrix whose eigenvalues are all positive.
  • A positive semi-definite matrix A is a symmetric matrix whose eigenvalues are all nonnegative.
  • Equivalently, a symmetric matrix A is positive semi-definite if \(x'Ax >= 0\) \(\forall\) \(x\).

    Matrix decomposition
    The action of a matrix like \([3,0;0,2]\) is easy to understand because we can think of it as scaling along x-axis by 3 and scaling along y-axis by 2. We can think of its inverse as scaling along x-axis by 1/3 and scaling along y-axis by 1/2.We can also square such matrices (diagonal) easily.

    Those handy off-diagonal zeros arise from the fact that the matrix/transformation preserves the direction of vectors along x-axis and y-axis. Vectors with this property are called eigen-vectors, and if we can find the basis of eigenvectors (eigenbasis, i.e. our basis happens to contain all eigenvectors) of a matrix then we can reason about that matrix as effectively as we can reason about a diagonal matrix (basis just get scaled in the image).

    We can thus better understand a linear transformation by breaking it down into simpler linear transformations. That is why, often we try to decompose/diagonalize any matrix into an equivalent product of (distance and angle preserving matrices and matrices that just scale) matrices with basis of eigenvectors (diagonalization or eigenvalue decomposition).

    Diagonalizing (or eigenvalue decomposition) a matrix A means writing it in a form: \(A = VDV^{-1} \longleftrightarrow AV = DV \), where \(V\) contains columns which are linearly independent columns of eigenvectors of A. Recall that \(v\) is an eigenvector of \(A\) if \(Av = \lambda v\).

    Not every matrix has a basis of eigenvectors (like a rotation matrix or matrices which do not have linearly independent columns of eigenvectors). A matrix which does have a basis of eigenvectors is called Diagnolizable. Hence orthogonal matrices (rotational) are not diagonalizable over R but can be diagonalized over \(\mathbb{C}\).

    Spectral theorem
    The eigenspace decomposition of a diagonalizable matrix is even easier to understand if the eigenvectors happen to be orthogonal. It turns out that this happens exactly when the matrix is symmetric.

    The Spectral theorem tells us that if a matrix A is symmetric then A is orthogonally Diagnolizable (A has pairwise orthogonal eigenvectors). Conversely, every orthogonally diagonalizable matrix is symmetric. And in this case, \(A = VDV' = VDV^{-1}\)

    Gram matrix
    Although it seems that the spectral theorem may be of limited use since so many matrices are not symmetric, we see that we can associate any rectangular matrix with a symmetric square matrix that we can apply the spectral theorem to and use to extract insight about the original matrix. For any non-symmetric matrix \(A\), \(A'A\) is always symmetric which can be proved as: \[(A'A)' = A'(A')' = A'A\] So we can diagonalize \(A'A = VDV^{-1} = VDV'\)

    A'A is called the Gram Matrix of A and is always positive semi-definite:
    \( Av \rightarrow (Av)'(Av) = v'A'Av = (Av)'Av = (Av).(Av) = ||Av||^2 >= 0 \)

    Note that rank of A is equal to the rank of A'A: If \(Ax = 0 \rightarrow A'Ax = 0\) (multiplying both sides by A'). Hence, null space of A is a subset of null space of A'A Conversely,
    \(A'Ax = 0 \rightarrow x'A'Ax = 0 => ||Ax||^2 = 0 \rightarrow Ax = 0\).
    Since, A and A'A have same null space dimension and have same domain Rn, they also have same rank by rank nullity theorem.

    Polar decomposition
    Define \( \sqrt{A'A} = V.\sqrt{D}.V' \). Av has the same length as \( \sqrt{A'A}.v\) \(\forall\) \(v\) i.e. \(||Av||^2 = ||\sqrt{A'A}.v||^2\) i.e a rotation for v and hence, \(A = R.\sqrt{A'A}\) for some rotation matrix R This is called polar decomposition.

    The polar decomposition tells us that any square matrix is almost the same as some symmetric matrix, and the spectral theorem tells us that a symmetric matrix is almost the same as a simple scaling along the coordinate axes. (In both cases, the phrase "almost the same" disguises a composition with an orthogonal transformation). We should be able to combine these ideas to conclude that any square matrix is basically the same as a simple scaling along the coordinate axes!

    SVD
    \(A = R.\sqrt{A'A)} = R.V.\sqrt{D}.V' = (R.V).\sqrt{D}.V' = U.E.V' \)
    Hence, we do not need special matrices to map one set of orthogonal gridlines to another, every matrix does this, i.e. all matrices are orthogonally diagonalizable. This is called SVD.

    The diagonal entries of E, which are the square roots of the eigenvalues of A, are called the singular values of A. The columns of U are called left singular vectors, and the columns of V are called right singular vectors.

    Linear dependence is numerically fragile: if the columns of a matrix (with more rows than columns) are linearly dependent, then perturbing the entries slightly by adding tiny independent random numbers is almost certain to result in a matrix with linearly independent columns. However, intuition suggests that subverting the principles of linear algebra in this way is not going to solve any real-world problems that emerge from linear dependence relationships among columns of a matrix.

    This intuition is accurate, and it highlights the utility of having a generalization of the idea of linear independence which can quantify how close a list of vectors is to having linear dependence relationships, rather than remaining within the confines of the binary labels "linearly dependent" or "linearly independent". The singular value decomposition provides such a tool.

    We can conjecture that very small singular values indicates that columns would need to be removed to obtain a matrix which does not have approximate linear dependence relationships among its columns.