Distributions

Raoul Grouls

Distributions

A probability distribution is:

a mathematical description

of a random phenomenon

in terms of all its possible outcomes

and their associated probabilities

Distributions

In this image, what do you expect to be:

  • all the possible outcomes?
  • the associated probabilities?

Distributions: types

The main types of distributions are:

  • Discrete : when an outcome can only take discrete values (e.g. number of birds)
  • Continuous : when outcomes take continuous values (e.g. blood pressure)

Distributions: basic visualisation

Every horizontal line you draw can be interpreted as a continuous distribution. Every barplot as a discrete distribution.

All the distributions we are going to discuss are variations of these two basic types!

For parametric distributions, we have a formula that describes the line / bars. You just put in the parameters, and the output is the line / bars.

Discrete distributions: PMF

A probability mass function (pmf) describes the probability distribution of discrete variables.

Consider a toin coss:

\[ f(x) = \begin{cases} 0.5 & x \text{ is head} \\ 0.5 & x \text{ is tails} \end{cases} \]

This is the pmf of the Bernoulli distribution

Conditions for a PMF

  1. An event cannot have a negative probability
  1. The sum of probabilities of all events must be 1
  1. The probability of a subset \(X\) of outcomes \(T\) is the same as adding the probabilities of the individual elements.

Conditions for a pmf

Mathematical description

The probability is a function \(f\) over the sample space \(\mathscr{S}\) of a discrete random variable \(X\), which gives the probability that \(X\) is equal to a certain value. \[f(x) = P(X = x)\]

Each pmf satisfies these conditions: \[ \begin{align} f(x) \geq 0 ,\, \forall x \in X\\ \Sigma_{x \in \mathscr{S}} \, f(x) = 1 \end{align} \]

for a collection \(\mathscr{A}\) \[P(\mathscr{A} \in \mathscr{S}) =\Sigma_{x_i \in \mathscr{A}} f(x_i)\]

Continuous distributions: PDF

For continuous distributions, we use a probability density function (pdf).

  1. \(f(x) > 0 ,\, \forall x \in X\)
  1. The integral of the probabilities of all possible events must be 1 (area under the curve)
  1. The probability \(X\) of values in the interval \([a,b]\) is the integral from \(a\) to \(b\)

Continuous distributions: Conditions for a PDF

This might look like unnecessary mathematical details. But it is actually important to understand the difference.

Example: can you answer the question “What is the probability your body temperature is 37.0 C?”

The answer might be unexpected: 0!

Let’s say your answer is 25%. But what if your temperature is 37.1? does that count? Or 37.01?

Because the distribution is continuous you can only say something about the range “What is the probability your temperature is between 36.5 and 37.2 C?”

Quiz time

quiz

Normal Distribution: Central Limit Theorem

This is one of the distributions that is used most often.

A major reason for this is, that if you keep sampling and adding from a population you always end up with a normal distribution.

Normal Distribution: Central Limit Theorem

Take a persons height.

  • This is determined by a combination of 180 genes.
  • One gene will contribute to a longer neck, the other to longer legs
  • If we assume the genes contribute independently, height equals the sum of 180 genes.

Thus, height will be normally distributed. So will the weight of wolves or the length of a penguins wing.

Log-Normal Distribution

However, multiplying values will give you a long tail!

This is the case when variables interact in some way, and are not independent.

\[4 + 4 + 4 + 4 = 16\]

but

\[4 * 4 * 4 * 4 = 256\]

Log-Normal Distribution

However, multiplying values will give you a long tail!

This is the case when variables interact in some way, and are not independent.

\[4 + 4 + 4 + 4 = 16\]

but

\[4 * 4 * 4 * 4 = 256\]

This is common if variables interact with each other. Examples are stock prices, failures of machines, ping times on a network, income distribution.

Log-Normal Distribution

multiplying values will give you a fat-tail distribution! This will typically be a log-normal distribution:

if \(X\) is log-normal distibuted, then \(Y = log(X)\) will be a normal distribution.

Caveats: the Simpson’s paradox

The shaded area is the 99% confidence interval.

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Now I tell you that the \(x\) axis is the amount of hours invested in study, and the \(y\) axis is the average grade of a student.

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)