Applied Statistics

MathJax example

What and why?
Sample statistics is the science of changing your mind. Making decisions based on facts (parameters) is hard enough as it is, but sometimes we don’t even have the facts we need. And yet decisions we must make. What we do know (the sample) is different from what we wish we knew (the population). That’s what it means to have uncertainty. Given the outcomes, what are the characteristics of the underlying system i.e. the data-generating process? And we want to quantify this to make the decisions soundly.

A simple, or rather playful example is to imagine a patch of land with trees, and you are to decide which trees to cut down shall they be under the average height of the population of the trees of Earth. The parameter average height of trees forms the decision criteria. But you do not know its actual value (the average of the population i.e. all the trees in the world). We would just have an estimate of this true parameter. Our decision whether we chop down a particular tree or not will depend on this estimate. As such, we want our estimate to be as close to the true value of the parameter.

Taming uncertainty
In the real world there are various sources of uncertainty. Rolling a die cannot be perfectly modeled due to hundreds of different factors (angle of hand, force, angle of die, material of the table, air friction, and so on). In fact, it is the characteristic complexity of the real world that the system under study is well-off as always having uncertainty and we are well-off having tools to study them. Given a system (a data-generating process), what are the properties of the outcome?

We define a sample space \(\Omega\) as the set of all possible outcomes of a data-generating process. Actually this set must have specific properties and is called a measurable space and the space is then called \(\sigma\)-Borel space. Points \(\omega\) sample space are called sample outcomes.

Events are the set subset of the desirable outcomes in \(\Omega\). As mentioned, the said data-generating process exhibits randomness and we still what to make sound decisions. The events \(E\) thus become our primary object of study to work upon the data-generating process/sample space.

We study probability by assigning a real number to an event in \(\Omega\) satisfying:

  • \(P(A) \ge 0\)
  • \(P(\Omega) = 1\)
  • \(P(\cup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i)\) if \(A_1, A_2..\) are disjoint

    We first think about what is it that we're trying to model or make predictions about. We describe the model or the events in our sample space. To make decisions, we try figuring out the probability of occurrence of these events, \(P(E)\) which get mapped to real number \(\le 1\), signifying:
  • the relative frequency of occurrence of the event in a long sequence \(\rightarrow\) frequentist interpretation
  • the degree of belief that the event occurs \(\rightarrow\) bayesian interpretation
    These are two different schools of thought about the same thing.

    Now the system we model and thus our decision may depend on not just one but multiple events belonging to a particular sample space. We can thus ask ourselves what different kinds of events exist and in what way can we combine them.

    Random variables
    Random variables are functions that map from a random process to numbers. We can choose any mapping to a number. This enables us to do some algebra with probability and represent the probability of all events in \(\Omega\) graphically in terms of probability distributions. For example,

  • Consider a coin flip, \(\omega\) = {H, T}. Then \(X\) can be the outcome of the flip.
  • Consider two consequent coin flips, \(\omega\) = {(H,H), (H,T), (T,H), (T,T)}. Then \(X\) can be the number of heads in the outcome.
  • Consider a roll of die, \(\omega\) = {1, 2, 3, 4, 5, 6}. Then \(X\) can be a number less than equal to 4.

    Probability distribution
    Description of a random variable \(X\) in a probability distribution summarizes our desired outcome in the whole sample space. A nice way to think about them is to consider that they just represent the shape of the underlying distribution, just scaled.

    Discrete probability distribution
    \( pmf(x_i) = P(X = x_i) \)
    \( \sum_{i} pmf(x_i) = 1, i=1,2,3... \)
    For example: Bernoulli, Poisson, Uniform

    Continuous probability distribution
    \( \int_{a}^b pdf(x) dx = P(a \lt x \lt b) \)
    \( \int pdf(x) dx = 1 \)
    \(P(X=x) = 0\)
    For example: Normal, Exponential, Gamma, Uniform

    Cumulative distribution function (CDF)
    Used to evaluate the accumulated probability, it is defined as the area to the left of a Probability distribution from a point of interest.
    \( cdf(x) = P(X \le x) \)
    \( 0 \le cdf(x) \le 1 \)

    In case of a discrete domain, \(cdf(x_i+1) = cdf(x_i) + pmf(x_i)\), a non-decreasing step function.

    In case of continuous domain, \(cdf(x) = \int_{-\infty}^\infty pdf(x) dx\), a non-decreasing continuous function. In this case, \(P(a \lt\ x \lt b) = cdf(b) - cdf(a)\)

    Quantiles
    Consider any distribution of \(X\), say Normal. We know that \( \int pdf(x) dx = 1 \). Say you want to divide this area under the distribution into 4 equal parts. Each part would be such that the probability of \(X\) being in any of the 4 parts would be equal, i.e. each region equals 25% of the total probability.

    Let us denote the values \(X=x_i\) that make the split happen as:
    \(x_1 \rightarrow 0.25^{th} quantile\)
    \(x_2 \rightarrow 0.5^{th} quantile\)
    \(x_3 \rightarrow 0.75^{th} quantile\)
    Hence, \(cdf(x_1) = 0.25 \implies x_1 = cdf^{-1}(0.25) \)
    Similarly, \(cdf(x_2) = 0.5 \implies x_1 = cdf^{-1}(0.5) \)

    In general a quantile, \[x_{i/n} \in X: x_{i/n} = cdf^{-1}(i/n) \] Note that the \(0.5^{th}\) quantile is just the median of the distribution.

    Expectation
    Ask yourself, where is the average value in the range of the \(pdf(x)\) expected to be? The expectation or expected value \(E(x)\) is a measure of location of central tendency.

    Let \(X\) be a continuous-random variable with a range \([a,b]\). Then the expected value of \(X\) is defined as, \[E(X) = \int_{a}^b x.pdf(x) dx\] For a discrete random variable, \[E(X) = \sum_{i=1}^n x_i.pdf(x_i) \] For example, let \(X\) have a range \([0,2]\) and density \(\frac{3}8 x^2 \)
    \(E(X) = \int_{0}^2 x.pdf(x) dx = \int_{0}^2 x\frac{3}8 x^2 dx = \frac{3}8 \Bigg[\frac{x^4}4 \Bigg ]_{0}^2 = \frac{3}2 \)
    Note that the mean is on the right half of the range. Since the probability density increases quadratically as \(x\) increases over the range, the average value of \(x\) is expected to be in the right of the range.
    As another example, consider \(X \sim N(0,1)\). Because of symmetry, we can infer that the expected value must be zero!

    Variance
    Let \(X\) be a continuous random variable with expectation \(E(X)\). The variance of \(X\) is defined as: \[ Var(X) = E((X - \mu)^2) = E((X - E(X))^2) \]
    Also, \(Var(X) = E(X^2) - E(X)^2 \)
    Simply put, variance is a measure of spread of the data.

    Statistical inference
    Now given the outcomes (sample data), what can we say about the process that generated the data? We use the data to infer, or rather approximate the underlying distribution that generated the data via:

  • parameter estimation: what parameter values possibly shape the underlying distribution?
  • hypothesis testing: is the assumed law for the underlying distribution correct or should we change it?

    The only information we have is the sample! From the sample, statistical procedures can infer the likely properties of the population.

    Parameter estimation
    Given an underlying unknown distribution, we want to know its nature i.e. the law and the parameters that shape the distribution assuming a particular law. To find the estimates of the parameters (since that's the best we can do!), we define estimators.
    An estimator \(\widehat{\theta}\) of an underlying true parameter \(\theta\) is a function of the data. Note that \(\theta\) is a fixed quantity whereas \(\widehat{\theta}\) is a random value with the notion of an expected value and its distributionis called the sampling distribution. We can define any estimator but ofcourse, some are more useful than others. We usually care about estimator functions that map the sample of the data to parameters like mean, variance, standard deviation of the data because these are usually enough to completely define a distribution.

    Next we want to know how to define these estimator functions \(\widehat{\theta}\) and once we have them, how can we assess their quality?
    Q.How to define an estimator function \(\widehat{\theta}\)?
    Ans.We have two methods:
    1. Method of moments
    2. Method of maximum likelihood

    Q.How do we assess the quality of a particular estimator function \(\widehat{\theta}\)?
    Ans.We commonly use:
    1. Bias
    2. Mean-squared error

    Method of moments
    So, we are trying to arrive at a formal procedure to compute the model parameter from the observed data. Method-of-moments is one such procedure. In general, for any parameter of any distribution, we can use it for generating estimator functions of parameters. It is simple to compute but in only few cases and becomes tedious for others. Moreover it is non-optimal and is usually only used to get starting values for other methods. Given a pdf \(f(x)\), we define the \(i^{th}\) theoretical moment as: \[ \mu_i = \int x^i f(x) dx \] Note that the first theoretical moment is just the mean. In other words, the expected value has a physical interpretation as the center of mass (or the first moment in statics).
    Now assume we observe an i.i.d sample \(X_1,X_2,..X_n\) from an unknown distribution \(f(x|\theta)\). Then the method of moment replaces the \(i^{th}\) theoretical moment with the \(i^{th}\) sample moment, \[\mu_i(\theta) = \widehat{\mu_i}\] That's all there is. Again, simple to use in some cases, useless in others. Note that in d-dimensional cases, i.e. with \(d\) parameters, we would solve \(d\) such equations.

    Method of maximum likelihood
    The maximum likelihood estimates of the parameters \(\theta_1, . . . , \theta_n\) are the values \(\widehat{\theta_1}, . . . , \widehat{\theta_n}\) that maximize the likelihood function.

    So there is an underlying true distribution \(X_1, X_2,...,X_n \sim P(X|\theta)\). We might know or can assume the nature of the distribution of this underlying data according to some standard distributions we can deal with. We ask ourselves, given some value of \(\widehat{\theta}\), what is the probability of seen the data \(X\)? In other words, we want to maximize \(P(X|\widehat{\theta})\). The \(\widehat{\theta})\) that does that is the maximum likelihood estimation. \[ P(X_1|\widehat{\theta})P(X_2|\widehat{\theta})P(X_3|\widehat{\theta})...P(X_n|\widehat{\theta}) = \prod_{1}^n P(X_i|\widehat{\theta}) = L(\widehat{\theta})\] Essentially, the likelihood function value would be high if more and more \(X_i\) are surrounding a particular value of \(\widehat{\theta}\) considered. Naturally, the product \(\prod_{1}^n P(X_i|\widehat{\theta})\) would be highest for estimated value of \(\widehat{\theta}\) which would be closest to the underlying true value \(\theta\). Hence the idea of maximizing the likelihood.

    Now when maximizing the likelihood function, it is often easier to maximize the log of the likelihood function. Taking the log converts the product to the sum. Since the log is a monotonic function, its maximum must occur for the same value of \(\widehat{\theta}\) as does the likelihood function. \[ \max_\widehat{\theta} \log(L(\widehat{\theta}))\] \[=\min_\widehat{\theta} \Bigg[-\log(L(\widehat{\theta})) \Bigg] \] \[=\min_\widehat{\theta} \Bigg[-\sum_{i=1}^n P(X_i|\widehat{\theta})\Bigg]\]

    In general, one can show that maximum likelihood estimators:

  • have bias that goes to zero for large sample sizes
  • have minimum variance for large sample sizes
  • often have approximately normal distributions for large sample sizes

    Confidence intervals
    Confidence intervals provide an alternative to using an estimator \(\widehat{\theta}\) when we wish to estimate an unknown parameter \(\theta\).

    What does “95% Confidence“ indicate? Imagine that we collect 100 random samples from a population. We’d end up calculating 100 different confidence intervals for the mean. If we set the confidence level at 95%, we’d expect that 95 out of 100 confidence intervals will contain the actual population parameter.

    We can find an interval [A,B] that we think has high probability of containing \(\theta\). The length of such an interval gives us an idea of how closely we can estimate \(\theta\).

    We can find the mathematical formula for the sampling distribution of the estimator. We can then make use of this sampling distribution to get an exact confidence interval for the parameter.

    Now, a pivot is a function of the data and the parameter such that its distribution does not depend on the parameter. It is not a statistic since it contains the unknown parameter (recall that a statistic is a function of the sample data that estimates a parameter and cannot contain the unknown parameter). In constructing a CI, we want to manipulate the pivot to get an interval about the unknown parameter.

    Finally, to construct an exact CI:
    1. Find a pviot \(h(X_1,..,X_n,\theta)\)
    2. Using the distribution of \(h(X_1,..,X_n,\theta)\), let \(a, b\) be the values of \(h\) such that the probability that \(h\) lies in between \(a, b\) is \(1-\alpha\), i.e. \( P(a \le h \le b) \) = \( 1-\alpha \). Then, \[ P(a \le h(X_1,..,X_n,\theta) \le b) = 1-\alpha \] \[ \implies P(LB(X_1,...,X_n) \le \theta \le UB(X_1,...,X_n)) = 1-\alpha \]

    Bootstrap
    For simple cases, we can obtain the sampling distribution and based on it, we can obtain the confidence intervals for the unknown parameters. In theory, only standard distributions allow us to perform these tasks by the introduced methods. However, usually, the underlying distribution model is very complicated or even unknown and this makes it impossible to obtain the sampling distribution.

    We know that sampling distributions and confidence intervals are desired properties about the estimator. In these situations, we can use computer simulations to approximate the sampling distribution and the CI. This is referred to as bootstrapping.

    The basic idea of bootstrap is that the sample we have is the best guess we have as to the shape of the population from which the sample was taken. Therefore, instead of assuming typical mathematical shape for the population, we instead use the shape of the sample.

    From a single sample, only one statistic like mean can be obtained. In order to reason about the underlying population, we need some sense of variability of the mean that we have computed.

    Using 1000s of resamples, we compute the statistic for each and get a histogram of bootstrap statistic. This provides an estimate of the shape of the distribution of the statistic/estimator from which we can answer questions about how much it varies via CI based on bootstrap!

    Hypothesis testing
    Statistical tools we have worked towards so far are inference problems regarding a parameter, which can be estimated from the sample data either by a single number (a point estimate) or an entire interval of plausible values (confidence intervals). But usually, in applications, the objective is not to estimate a parameter but to decide which of two contradictory claims about the unknown parameter is correct. This is the realm of hypothesis testing.

    We are interested in the parameter \(\theta\) of the underlying probability distribution which is unknown. We only know that it belongs to a certain parameter space \(\Theta\), which can be partitioned into two disjoint set \(\Theta_0\) and \(\Theta_1\). We must decide whether \(\theta\) lies in \(\Theta_0\) or \(\Theta_1\), i.e. \[ H_0: \theta \in \Theta_0 \] or \[ H_1: \theta \in \Theta_1 \] Exactly one hypothesis must be true.