Appendix B — Essential Probability for Data Science
In data science, we often model data as if it was generated from a particular random process. We then compute probabilities based on the model, and compare the likelihood of different observations based on that model, allowing us to make decisions.
Random processes can be finite, taking only a few possible values (like flipping a coin has only two outcomes); discrete (taking “countably” many values, such as the number of emails received in an hour (under the assumption that it could be arbitrarily high/ not ruling out any value for being “too high”); or continuous, taking any value within a range (like randomly selecting a person and measuring their height).
We will focus on finite probability in this class. Occasionally, I’ll share how the results differ when you have to work with infinitely many possible outcomes, e.g. if you have a continuous-valued random process.
B.1 Probability terms and symbols:
Sample space \(\Omega\): a set of possible combinations of observations
Outcome \(x\): An individual possible observation from a random process (an element of the set \(\Omega\) )
\(P(x)\): The probability of observation/outcome \(x\)
Event \(A\): A combination of outcomes (a subset of \(\Omega\))
\(P(x \in A)\): Probability any outcome belongs to the event \(A\), also written \(P(A)\)
If \(\Omega\) is a finite set (that is, we’re not working with calculus yet), In order to call \(P\) a probability rule for \(\Omega\), we need:
\(0 \leq P(x) \leq 1\) for every outcome \(x\) in \(\Omega\) (We never have probabilities below 0 or over 1)
The probability of any event \(E\) is equal to the sum of the probabilities of its outcomes \(P(x \in E) = \sum_{x\in \Omega} P(x)\)
The probability of \(P(x \in \Omega) = 1\) .That is, \(\sum_{x \in \Omega} P(x) = 1\)
B.1.1 Example. Flipping a coin.
Suppose we flip a coin 3 times and record the results. All of the below outcomes are equally likely.
\(\Omega = \{ HHH, HHT, HTH, THH, HTT, THT,TTH, TTT \}\)
Any single element of \(\Omega\) is an observation.
\(P(HHH) = \frac{1}{8}\)
If \(A=\) the event where we obtain exactly 2 Heads, then \(P(A)=\frac{3}{8}\)
If \(B =\) the event where we obtain a Heads on the last flip then \(P(B)=\frac{4}{8}=0.5\)
B.2 Combining Events
The following notation refers to forms of event combination:
The event \(P(A \cup B)\), read “Probability of A or B” or “A union B” refers to the probability an outcome belongs to event \(A\), event \(B\), or both.
The event \(P(A \cap B)\), read “Probability of A and B” or “A intersect B” refers to the probability an outcome belongs to event \(A\) and event \(B\).
The event \(P(A^c)\) , read “Probability of the complement of \(A\)” refers to the probability an outcome Does not belong to event \(A\).
We often express these visually in a Venn diagram.
General Addition Rule: \(P(A \cup B) = P(A)+P(B)-P(A \cap B)\)
Complements Rule: \(P(A^c) = 1-P(A)\)
B.2.1 Example: Combining our events from before.
If \(A=\) the event where we obtain exactly 2 Heads, and if \(B =\) the event where we obtain a Heads on the last flip then:
\(P(A \cap B) = P(HTH)+P(THH) = \frac{1}{8}+\frac{1}{8} = \boxed{\frac{1}{4}}\)
To compute \(P(A \cup B)\), While we could count up all the events where we obtain exactly 2 Heads, or obtain Heads on the last flip, or both, we could instead use the general addition rule:
\[ P(A \cup B) = P(A)+P(B)-P(A\cap B) = \frac{3}{8} + \frac{4}{8} - \frac{2}{8} = \boxed{\frac{5}{8}} \]
B.3 Independence and Dependence
In general, knowing whether one event occurs might or might not inform you about whether a second event occurs.
We write \(P(A|B)\) for the probability an event \(A\) occurs, given an event \(B\) is known to have occurred.
The probability that event \(A\) occurs, given (conditioned on) event \(B\), is given by:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
For our events \(A\) and \(B\) from above,
\[ P(A|B) = \frac{1/4}{1/2} = \frac{1}{2} \]
However, \(P(B|A) = \frac{1/4}{3/8} = \frac{2}{3}\)
if it happens to be the case that \(P(A)=P(A|B)\) for two events \(A\) and \(B\), then we call them Independent. Equivalently, we could check whether \(P(A \cap B)=P(A)\cdot P(B)\).
Because \(P(A)=\frac{3}{8}\) and \(P(B)=\frac{1}{2}\), it is NOT the case that \(P(A)=P(A|B)\) and it is NOT the case that \(P(A \cap B) = P(A)P(B)\); that is, knowing that we flipped 2 Heads gives us additional information as to whether the third coin is Heads or not.
B.4 Bayes’ Theorem
If \(A\) and \(B\) are two events, Bayes’ Theorem gives us a way to “reverse the order of conditions.”
If \(A\) and \(B\) are any two events,
\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)
Proof
Rearrange the formulas defining \(P(A|B)\) and \(P(B|A)\) to solve for \(P(A\cap B)\).
\(P(A\cap B) = P(B)P(A|B)\) and \(P(A\cap B) = P(A)P(B|A)\)
Setting these equal to each other:
$$ P(A|B)P(B) = P(B|A)P(A)$$
Then divide both sides by \(P(B)\).
Example: A person goes to get tested for COVID. During this particular week, the clinic knows that about 20% of people who test are actually positive for COVID. They also know that if a person has COVID, the probability the test shows a true positive is 90%, and if a person does not have COVID, the probability a test shows a true negative is 99%. Given a person tests negative, what is the probability (likelihood) they have COVID anyways?
Solution
We want \(P(covid|negative)\). We know \(P(negative|covid)=.1\) and \(P(covid)=.2\) . We also need \(P(negative)\), which is obtained by using a tree diagram. Applying Bayes’ theorem, we compute!
Some people dislike the use of the term “probability” in this context, because it makes it sound like the person could randomly be-or-not-be sick. For these scenarios, we often use the term “Likelihood” instead of “Probability,” but this is primarily to make it clear that we recognize that any individual person is definitively sick vs. not sick (even if we cannot know for sure), but if we were to pick a random person from all the people who test positive, we expect the probability that person is sick to be given by \(P(sick|positive)\).