Notes (Statistics)

Here, a collection of basic theorems and definitions used in statistics are listed, without going too deeply into a specific subtopic. From time to time, more will be added here.

Contents

Definitions and Properties
Theorems

Definitions and Properties

Stochastic convergence

Let $\{X_n\}_{n\in\mathbb N}$ be a sequence of real-valued random variables with cumulative distribution functions $F_n$, converging to a real-valued random variable $X$ with distribution function $F$. There are several forms of convergence, which can be distinguished:

The sequence is said to converge to $X$ in distribution, if $\underset{n\to\infty}{\lim}F_n(x)=F(x)$ holds for all continuous points of $F$.
Notation: $X_n\overset{d}{\longrightarrow}X$.
If we have $\underset{n\to\infty}{\lim}\mathbb P(|X_n-X|\gt\epsilon)=0$ for all $\epsilon>0$, the sequence converges in probability to $X$.
Notation: $X_n\overset{\mathbb P}{\longrightarrow}X$.
For $\mathbb P\left(\underset{n\to\infty}{\lim}X_n=X\right)=\mathbb P\left(\omega\in\Omega:\underset{n\to\infty}{\lim}X_n(\omega)=X(\omega)\right)=1$, where $\Omega$ denotes the sample space, the sequence converges almost surely to $X$.
Notation: $X_n\overset{a.s.}{\longrightarrow}X$.
If $\underset{n\to\infty}{\lim}\mathbb E\left[|X_n-X|^r\right]=0$ holds, $X_n$ is said to converge in the $r$-th mean. Notation: $X_n\overset{L^r}{\longrightarrow}X$.

Using this notation, we summarize some of the most important properties of the different forms of convergence:

$X_n\overset{a.s.}{\longrightarrow}X~~\Rightarrow~~X_n\overset{\mathbb P}{\longrightarrow}X~~\Rightarrow~~X_n\overset{d}{\longrightarrow}X $.
For $1\le r \lt s$: $X_n\overset{L^s}{\longrightarrow}X~~\Rightarrow~~X_n\overset{L^r}{\longrightarrow}X~~\Rightarrow~~X_n\overset{\mathbb P}{\longrightarrow}X$.
If $X_n\overset{a.s.}{\longrightarrow}X$ and $|X_n|\lt Y$ hold for a random variable $Y$ with finite mean, we have $X_n\overset{L^1}{\longrightarrow}X$.
(Slutsky's theorem): If $X_n\overset{d}{\longrightarrow}X$ and $Y_n\overset{\mathbb P}{\longrightarrow}c$ hold for a constant $c$, then: \begin{align}X_n+Y_n&\overset{d}{\longrightarrow}X+c,&&\\ X_nY_n&\overset{d}{\longrightarrow}cX&&\text{and}\\ X_n/Y_n&\overset{d}{\longrightarrow}X/c,&&\text{for }c\ne0.\end{align}

Landau (big-O) notation

Let $\{X_n\}$ be a set of random variables and $\{a_n\}$ a set of constants. Then, the notation is:

Small O:
$X_n=o_{\mathbb P}(a_n)~~\Leftrightarrow~~\frac{X_n}{a_n}=o_{\mathbb P}(1)~~\Leftrightarrow~~\frac{X_n}{a_n}\overset{\mathbb P}{\longrightarrow}0~~\Leftrightarrow~~\forall\epsilon,\delta\gt0,~\exists N_{\epsilon,\delta}\gt0, \text{ s.t. } \mathbb P\left(\left|\frac{X_n}{a_n}\right|\ge\delta\right)\le\epsilon,~\forall n\gt N_{\epsilon,\delta}.$
Big O:
$X_n=O_{\mathbb P}(a_n)~~\Leftrightarrow~~ \forall\epsilon\gt0,~\exists N_\epsilon,\delta_\epsilon\gt0, \text{ s.t. } \mathbb P\left(\left|\frac{X_n}{a_n}\right|\ge\delta_\epsilon\right)\le\epsilon,~\forall n\gt N_\epsilon.$

The small O notation describes a stronger property. One can note that $X_n=o_{\mathbb P}(a_n)$ implies $X_n=O_{\mathbb P}(a_n)$.

Order Statistics

Let $X_1,\dots,X_n$ be random variables drawn from an $iid$ sample with cumulative distribution function $F$. Arranging them according to their size, we get the order statistics $$X_{1:n}\le\dots\le X_{n:n}.$$ The indices in this notation purposely include the size of the sample $n$, as it makes the following definitions of asymptotic order statistics more clear.
As the sample is independent, we can easily deduct the distribution functions of the smallest and biggest sample values, $X_{1:n}$ and $X_{n:n}$: $$\mathbb P(X_{1:n}\le x)=1-\mathbb P(X_{1:n}\ge x)=1-\mathbb P(X_1\ge x,\dots,X_n\ge x)=1-(1-F(x))^n$$ and similarly $$\mathbb P(X_{n:n}\le x)=\mathbb P(X_1\le x,\dots,X_n\le x)=F(x)^n.$$ More generally, the distribution function and density function $f_{X_{k:n}}$ (if existent) of the $k$-th order statistic are given by $$\mathbb P(X_{k:n}\le x)=\sum_{m=k}^n\binom{n}{m}F(x)^m(1-F(x))^{n-m}$$ and $$f_{X_{k:n}}(x)=\binom{n}{k}kf(x)F(x)^{k-1}(1-F(x))^{n-k}.$$ Furthermore, if $F$ is absolutely continuous with density function $f$, the order statistics' joint density function is $$f_{X_{1:n},\dots,X_{n:n}}(t_1,\dots,t_n)=\begin{cases}n!\cdot f(t_1)\cdots f(t_n),&\text{ for $t_1\le~\dots~\le t_n,$}\\0,&\text{ else.}\end{cases}$$ Besides the order statistics for a fixed sample size $n$, one can look into the asymptotic case, distinguishing three types of asymptotic order statistics:

$X_{n-k:n}$ for a fixed $k\in\mathbb N_0$ and $n\rightarrow\infty$ are called extreme order statistics.
$X_{n-k:n}$ with $k=k(n)\rightarrow\infty,n\rightarrow\infty$ and $\frac kn \rightarrow 0$ are called intermediate order statistics.
$X_{n-k:n}$ with $k(n),n$ as above but $\frac kn \rightarrow c\in(0,1)$ are called central order statistics.

Extreme value distributions

Let $X_1,\dots,X_n$ be an $iid$ sample with non-degenerate distribution function $F$. The cumulative distributive function $G$ is called extreme value distribution, if there are real sequences $a_n\gt0,b_n$, s.t. we have $$\frac{X_{n:n}-b_n}{a_n}\underset{n\to\infty}{\overset{d}{\longrightarrow}} G,$$ i.e. if the distribution of the (rescaled) sample maximum converges in distribution to it. The set of extreme value distributions is classified in the Fisher-Tippet-Gnedenko theorem.

Kurtosis

The kurtosis of a probability distribution can be used to describe its shape, particularly the shape of the tails. The standard definition was specified by Pearson and takes the form $$\kappa (X) = \frac{\mathbb E\left[(X-\mu)^4\right]}{\left(\mathbb E\left[(X-\mu)^2\right]\right)^2}.$$ If the distribution is unknown, it can be estimated by $$\hat\kappa(X)=\frac{\frac 1n\sum_{i=1}^n\left(X_i-\bar X\right)^4}{\left(\frac 1n\sum_{i=1}^n(X_i-\bar X)^2\right)^2},$$ where $\overline X$ describes the sample mean.
The benchmark value for kurtosis is given by the normal distribution with a value of $3$. If the kurtosis of a distribution is smaller $3$, it is said to be platykurtic, meaning that its tails are lighter. For values larger than $3$, a distribution is called leptokurtic, implying that it has heavier tails than the normal distribution.

Skewness

The skewness is used as a measure of asymmetry of a distribution. Usually, Pearson's specification is used, which is given by $$\gamma=\frac{\mathbb E\left[(X-\mu)^3\right]}{\left(\mathbb E\left[(X-\mu)^2\right]\right)^{3/2}}$$ and can be estimated by $$\hat\gamma(X)=\frac{\frac 1n\sum_{i=1}^n\left(X_i-\bar X\right)^3}{\left(\frac 1n\sum_{i=1}^n(X_i-\bar X)^2\right)^{3/2}},$$ with $\overline X$ describing the sample mean. A negatively/left-skewed distribution has its curve leaning to the right, forming a longer left tail. For positively/right-skewed distribution, the curve is leaning to the left and has a longer right tail.

Monotone random variable transformations

Let $X$ be a random variable with density function $f_X$ and $g$ a monotone function. Then, the density function $f_{g(X)}$ of $g(X)$ is given by $$f_{g(X)}(x)=\left|\frac d{dx}(g^{-1}(x))\right|\cdot f_X(g^{-1}(x)).$$

Censoring

In statistics, a data sample is censored, if the data is not uniquely specified in every observation, due to a lack of information, which may e.g. stem from measurement inaccuracy. In particular, censoring is not equivalent to truncation of data samples – a method that restricts the sample to a certain interval, fully omitting the observations outside the interval, without further specifying if its values were over or under the threshold.
The distinction can be made more clearly with an example: If one considers an experiment, in which some machine automatically detects the throwing distance of a ball, the data would be:

Complete, if every ball is detected with its exact throwing distance known.
Censored, if the machine could only detect the exact throwing distance at a range of, e.g., $10-50$ meters. However, it can still determine, whether the ball was thrown less than $10$ meters or further than $50$ meters.
Truncated, if the machine could only detect throwing distances of, e.g., $10-50$ meters. Any ball thrown outside this range is fully omitted and doesn't appear in the data sample.

There are several types of censoring that describe the lack of information:

Left censoring happens, when an observation is below a certain threshold, but its exact value is unknown.
Example (see above): Machine that returns $"\lt 10"$ as the data value when a throw is less than $10$ meters, but states the exact distance for all others.
Right censoring is analogous to left censoring, just in the opposite direction.
Example (see above): Machine that detects the distance for throws within a range of $0-50$ meters and returns $"\gt 50"$ for anything further than that.
Interval censoring is a combination of left and right censoring.
Example (see above): Machine that returns the distance for the range of $10-50$ meters, but returns $"\lt 10"$ or $"\gt 50"$ for any throw outside this range.
Type I censoring describes the situation, when an experiment on subjects runs for a fixed amount of time/attempts and the subjects are tested for a certain event. If this event doesn't happen during the experiment time/attempts, the results for these subjects are right censored.
Example: Clinical trial that tests mortality rate over a span of $10$ years.
Type II censoring is similar to type I, in the way that the experiment is canceled, before the event happens for all subjects. However, the experiment is not stopped after a predetermined time, but after the event count over all subjects passes a certain threshold. The results of the remaining subjects are then right censored.
Example: Clinical trial that tests lethality of a drug and is stopped after $10%$ of the participants died.
Type III / random censoring defines the case when the experiment for each subject may be stopped independently, before the event one wanted to test happened. The stopping time is neither predetermined, nor statistically depending on the event count. If the experiment is stopped for a subject before the event happened, its result is right censored.
Example: Clinical trial that tests life expectancy after some illness, where the participants might drop out at any time, making the exact time of death unknown.

Censored data may be handled by the Tobit model.

Theorems

Fisher-Tippet-Gnedenko Theorem

The set of extreme value distributions is given by $$\{G_\gamma(ax+b)|a>0,~b\in\mathbb{R}\},$$ where $$G_\gamma(x)=\begin{cases}\exp\left(-(1+\gamma x)^{-\frac 1\gamma}\right),&\text{ for }1+\gamma x>0,~\gamma\in\mathbb{R}\backslash \{0\}\\ \exp\left(-e^{-x}\right),&\text{ for }\gamma=0.\end{cases}$$ The coefficient $\gamma$ is also called the extreme value index. Three different types of distributions can be distinguished for different values of $\gamma$:

In the case of $\gamma\lt0$, we acquire complementary Weibull distributions for $\alpha=-\frac 1{\gamma}$: $$\Psi_\alpha(x):=G_{\gamma}\left(\frac{1+x}{-\gamma}\right)= \begin{cases} \exp(-(-x)^\alpha),& \text{for }x\lt0,\\ 1,&\text{for }x\ge0.\end{cases}$$
The distribution $G_0(x)=\exp\left(-e^{-x}\right)$ with $x\in\mathbb{R}$, which we get for $\gamma=0$, is labeled Gumbel distribution.
If $\gamma\gt0$, we get Fréchet distributions by setting $\alpha=\frac 1\gamma$: $$\Phi_\alpha(x):=G_\gamma\left(\frac {x-1}\gamma\right)=\begin{cases}0,&\text{for }x\le0,\\ \exp\left(-x^{-\alpha}\right),&\text{for }x\gt0.\end{cases}$$

Delta method

Let $(X_n)_{n\in\mathbb N}$ be any stochastic process, s.t. $$\sqrt n\big(X_n-\theta\big)\overset{d}{\longrightarrow}\mathcal N(0,\sigma^2)$$ holds for finite constants $\theta$ and $\sigma^2$. Then, for any function $g$ with existent derivative $g'(\theta)$, which is unequal zero for all $\theta$, we have $$\sqrt n\big(g(X_n)-g(\theta)\big)\overset{d}{\longrightarrow}\mathcal N(0,\sigma^2(g'(\theta))^2).$$

Continuous mapping theorem

(Special case) Let $\{X_n\}, X$ be random variables on $\mathbb R$ and $g$ any real function, whose discontinuity points form a null set w.r.t. the probability measure of $X$. Then, the following implications hold: \begin{align} X_n\overset{d}{\longrightarrow}X&~~\Rightarrow~~ g(X_n)\overset{d}{\longrightarrow}g(X),\\ X_n\overset{\mathbb P}{\longrightarrow}X&~~\Rightarrow~~ g(X_n)\overset{\mathbb P}{\longrightarrow}g(X) \text{ and}\\ X_n\overset{a.s.}{\longrightarrow}X&~~\Rightarrow~~ g(X_n)\overset{a.s.}{\longrightarrow}g(X).\\ \end{align}

Chebyshev's inequality

Let $X$ be a random variable with $\mathbb E[X]=\mu\lt\infty$ and $Var(X)=\sigma^2\lt\infty$ unequal zero. Then, for any real $k\gt0$, $$\mathbb P(|X-\mu|\ge k\sigma)\le\frac1{k^2}.$$

Central limit theorem

(Lindeberg-Lévy) Let $\{X_n\}_{n\in\mathbb N}$ define a sequence of $iid$ random variables with $\mathbb E[X_1]=\mu$ and $Var(X_1)=\sigma^2\lt\infty$. Then, $$\sqrt n\left(\left(\frac 1n\sum_{i=1}^n X_i\right)-\mu\right)\overset{d}{\longrightarrow} \mathcal N(0,\sigma^2)$$ holds.
(Multidimensional) There also is a central limit theorem for random vectors $\{X_n\}_{n\in\mathbb N}\subseteq \mathbb R^k$ with $\mathbb E[X_1]=\mu\in\mathbb R^k$ and covariance matrix $\Sigma\in\mathbb R^{k\times k}$. It can be followed that $$\sqrt n(\overline X_n-\mu)\overset{d}{\longrightarrow}\mathcal N_k(0,\Sigma)$$ holds, where $\overline X_n$ denotes the sample mean vector and $\mathcal N_k$ is the multivariate normal distribution.

Law of large numbers

There are two forms of laws of large numbers.
(weak LLN) Let $\{X_n\}_{n\in\mathbb N}$ define a sequence of $iid$, Lebesgue integrable random variables with $\mathbb E[X_1]=\mu$ and possibly infinite variance. Then, $\overline X_n\overset{\mathbb P}\longrightarrow\mu$ holds.
(strong LLN) The strong version is very similar to the weak one, except that there are some cases, where it can't be applied (e.g., when the random variables don't have a common probability space). It states that $\overline X_n\overset{a.s.}\longrightarrow\mu$ holds.

Law of the iterated logarithm

(Asymptotic behavior of sums of random variables) Let $X_1,X_2,\dots$ be a sequence of $iid$ random variables with zero mean and unit variance.
Then, the following limits hold: $$\underset{n\to\infty}\limsup \frac{\sum_{i=1}^nX_i}{\sqrt{2n\log(\log(n))}}\overset{a.s.}=1$$ and $$\underset{n\to\infty}\liminf \frac{\sum_{i=1}^nX_i}{\sqrt{2n\log(\log(n))}}\overset{a.s.}=-1.$$