Probability – Probably a good Idea

The concept of probability is ubiquitous in our modern world, but how does it work, and how do we construct these tools from the ground up?

The Probability Space

The underlying tool we will use for probability is our probability space, this consists of 3 main items

$\Omega$ , the sample space, which is the collection of outcomes.
$\mathcal{F} \subseteq \mathcal{P}( \Omega)$ , the event space, which is the collection of possible events, which are collections of elements of the sample space.
$\mathbb{P}: \mathcal{F} \to [0,1]$ , the probability map, which gives the probability of a given event.

Example

Let’s take the example of a $6$ sided dice roll…

$\Omega$ would be the possible outcomes, so $\Omega = \{1,2,3,4,5,6\}$
$\mathcal{F}$ would be all possible combinations of elements of $\Omega$ . This is because not only are there events such as the die lands on $4$ (Which would be $\{4\}$ ), but there are also the events such as the die lands on an even number (Which would be $\{2,4,6\}$ ).
Provided the die is fair, then $\mathbb{P}$ would simply be $\mathbb{P}(A) = \frac{1}{6} \text{num}(A)$ , where $\text{num}(A)$ is the number of elements in $A$ .

Axioms of Probability

These then obey certain axioms of probability. First, $\mathcal{F}$ must be a $\sigma$ -algebra on $\Omega$ . This is a type of structure that follows these 3 main properties.

$\emptyset \in \mathcal{F}$
$A \in \mathcal{F} \implies \Omega \setminus A \in \mathcal{F}$
$A_{i} \in \mathcal{F} \implies A_1 \cup A_2 \cup A_3 \ldots \in \mathcal{F}$

Or in other words

The event of nothing is an event
If some event can happen, then it not happening is also an event that can happen.
If $A_i$ are a list of events that can happen, then the event that any of $A_i$ occur is an event that can happen.

And $\mathbb{P}$ must follow the following axioms.

$\mathbb{P}(\Omega)=1$
$\mathbb{P}(A) \ge 0$ $\forall A\in \mathcal{F}$
Given $A_i \in \mathcal{F}$ disjoint, then $\mathbb{P}(\bigcup A_i) = \sum \mathbb{P}(A_i)$

Which means that,

The probability that any outcome will happen is 1, or in other words, at least one outcome must happen.
The probability of any event must be $\ge 0$ .
Given any $A_i$ such that they don’t intersect, meaning $A_i \cap A_j = \emptyset$ (which although not technically true, one can think of this as $\mathbb{P}(A_i \cap A_j) = 0$ ), then the probability of any of the events happening is the same as the sum of the probabilities of each event.

Bayes Theorem

Now that we have defined our probability function, we can use it to dicover some facts about the probability of events. Note for all the future discussions we need to assume, unless stated otherwise, all probabilities will be $>0$ , this simplifies some of the upcoming work.

One important tool for this is going to be the symbol $|$ , which in words means “given”. Thus $\mathbb{P}(A|B)$ means the probability that the event $A$ occurs, given the event $B$ occured. We also need to define something called a partition, this is a colleciton of events $B_i$ for $i \in I$ such that all $B_i$ are disjoint, and together $\bigcup B_i = \Omega$ . Or in words, this is a collection of events such that exactly one of them will occur. For example if we are rolling a die, the partition could be $B_1$ being the number is odd and $B_2$ being the number is even.

The first fact that will be of use for us is that

\mathbb{P}(A|B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}

This actually does not need a proof, as this is the definition of what $\mathbb{P}(A|B)$ is (In our current logic, the actual definition is far more complex and beyond the scope of this post).

Law of total probability

The first consequence of our defintions above is the law of total probability (LTP), which states that given a partition $B_i$ , then

\mathbb{P}(A) = \sum_{i\in I} \mathbb{P}(A|B_i)\mathbb{P}(B_i)

The proof of this fairly simple, but will be ommitted here (See if you can do it yourself!).

What does this tell us? Essentially, this tells us that if we have an event, the probability that it occurs can be divided into some fractions of the probability of the partitions occuring, based on the chance that our event occurs given a partition occurs.

Bayes Theorem.

You may have noticed that we can rearrange the formula for $\mathbb{P}(A|B)$ and realised we can rearrange it to get a formula for $\mathbb{P}(A \cap B)$ , which is symmetric in $A$ and $B$ , thus

\mathbb{P}(A|B)\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A)

Hence,

\mathbb{P}(B|A) = \frac{\mathbb{P}(A|B)\mathbb{P}(B)}{\mathbb{P}(A)}

This is Bayes theorem (Some people use the law of total probability on $\matbb{P}(A)$ in their statement). Essentially it tells us that we can reverse conditional probabilities.

Random Variables

The next key tool in the study of probability is that of the random variables. We can think of these as objects with a random value, such as the value on a die, the number of people in a room, or the height of a person.

There are $2$ main types of such variables, these are continuous and discrete. Technically a random variable can be neither, but this only occurs in specific circumstances which wont be disscussed here.

The basic construction

A random variable is a map from $\Omega$ to some output space, $E$ . Very commonly $E$ will be $\mathbb{R}$ , although this does not need to be the case, we shall assume $E=\mathbb{R}$ from now on. We also require that a random variable satifies that the set $\{\omega : X(\omega) \le x\}$ is an event. This then lets us give a value to each possible event, for example if $X$ was the year a person was born, then if $\omega$ was the event that we chose Kurt Gödel, then $X(\omega) = 1906$ . We could then say what is the probability that the age of a person is $1906$ , or more formally, what is the value of $\mathbb{P}(\{\omega: X(\omega) = 1906\})$ , this is often shortened to $\mathbb{P}(X = 1906)$ . This in words is then the probability that the sampled event is an event that gives the year $1906$ .

If X has finite or countably many outputs, then its discrete, and hence by the axioms above, $\{\omega : X(\omega) = x\}$ is then an event. We can then define $p_X (x) = \mathbb{P}(X = x)$ , we call $p_X$ the probability mass function.

The defintion of a contionous random variable is more complex, as we can’t say that $\{\omega : X(\omega) = x\}$ is an event, so instead we say $X$ is continuous if there exists a function $f_X: \mathbb{R} \to \mathbb{R}$ such that $f_X(x) \ge 0$ for all $x$ , $\int_{-\infty}^{\infty} f_X(u) du = 1$ , and

F_X(x) = \mathbb{P}(X \le x) = \int_{-\infty}^{x} f_X(u) du

Here we call $F_X$ the cummulative density function, and $f_X$ the probability density function.

Why?

With this, we can define some useful properties, first we could want to consider the mean (or average) of a random variable, to do this we will define the expectation, this is expected value a random variable will give us, and is denoted $\mathbb{E}\left[X\right]$ . For discrete random variables, this is calculated by

$\mathbb{E}\left[X\right] = \sum_{\text{Possible Outcomes x}} xp_X(x)$

Or in words, the expectation is the weighted average of all the outcomes, weighted by the probability of getting that outcome. Then for moving into the continous case, we replace sums with integrals to get

\mathbb{E}\left[X\right] = \int_{-\infty}^{\infty} xf_X(x) dx

Which in meaning is very similar to above, just using integrals.

Another property we could define is variance, here we will use the expectation to say the variance is the expected square (as so to make it positive) distance from the mean, and hence is calculated

Var(X) = \mathbb{E}\left[\left(X -\mathbb{E}(X)\right)^2\right]

Or the much easier to compute version

Var(X) = \mathbb{E}\left[X^2\right] – \mathbb{E}\left[X\right]^2

One could then choose to square root this and get the standard deviation, denoted $\sigma_X$ .