Mục Lục
More Probability
The axioms of probability say what mathematical rules must be followed
in assigning probabilities to
events.
Let P(A) denote the probability of the event A. The axioms are rules the
function P must follow:
The
Axioms of Probability
- The probability of every event is at least zero.
- The probability of the entire outcome space is 100%.
- If two events are disjoint, the probability that either happens is the
sum of the probabilities that each happens.
(For every event A, P(A) >= 0. There is no such thing as a negative
probability.)
(P(S) = 100%. The chance that something in the outcome
space occurs is 100%, because the outcome space contains every possible
outcome.)
(If AB = {}, P(AUB) = P(A) + P(B).)
Everything else that is mathematically true of probability is a consequence
of these axioms, and of further definitions.
For example, we have the complement
rule:
The
Complement Rule
the probability that an event A does not happen is 100% minus the
probability that A happens:
P(Ac) = 100% – P(A).
The complement rule can be derived from the axioms: the union of A and
its complement is S (either A happens or it does not, and there
is no other possibility), so P(AUAc)
= P(S) = 100%, by axiom 2. The event A and its complement are disjoint
(if “A does not happen” happens, A does not happen; if A happens, “A does
not happen” does not happen), so P(AUAc)
= P(A) + P(Ac) by axiom 3. Putting these together, we get P(A)
+ P(Ac) = 100%. If we subtract P(A) from both sides of this
equation, we get what we sought: P(Ac) = 100%-P(A).
A special case of the complement
rule is that
P({}) = 0%,
because P(S) = 100%, and Sc = {}.
An event A that has probability one is said to be certain
or sure. S is certain.
The union of two events, A UB, can
be broken up into three disjoint sets:
elements of A that are not in B (ABc)
elements of B that are not in A (AcB)
elements of both A and B (AB)
Together, these three sets contain every element of AUB.
Therefore, the chance that either A or B occurs is
P(AUB) = P(ABc U
AcB U AB
).
The three sets on the right are disjoint, so the third axiom implies
that
P(AUB) = P(ABc)
+ P(AcB) + P(AB).
On the other hand,
P(A) = P(ABc U AB) = P(ABc)
+ P(AB),
because ABc and AB are disjoint. Similarly,
P(B) = P(AcB
U AB) = P(AcB)
+ P(AB),
because AcB and AB are disjoint. Adding, we find
P(A) + P(B) = P(ABc)+ P(AcB) +2×P(AB).
This would be P(AUB), but for the
fact that P(AB) is counted twice, not once. It follows that, in general,
P(AUB) = P(A) + P(B) – P(AB).
Again, while this is a true statement, it is not one of the
axioms
of probability. In the special case that AB = {}, this reduces to one
of the axioms, because, as we saw in the preceding paragraph, P({}) = 0%.
It follows that
P(AUB) <= P(A) + P(B),
because, by axiom 1, P(AB) >= 0.
Moreover, because taking a union can only include additional outcomes,
P(AUB) >= P(A), and
P(AUB) >= P(B).
Probability is analogous to area or volume or mass. Consider the unit
square, which has length unity on each side. Its total area is 1 (= 100%).
Let’s call the square S, just like outcome space. Now consider regions
inside the square S (subsets of S). The area of any such
region is at least zero, the area of S is 100%, and the area of
two regions is the sum of their areas, if they do not overlap (i.e.,
if their intersection is empty). These facts are direct analogues of the
axioms of probability, and we shall often use this model to get intuition
about probability.
A further analogy that I find useful is to consider the square S
to be a dartboard. A trial or experiment consists of throwing a dart at
the board once. The event A occurs if the dart sticks in the set A. The
event AB occurs if the dart sticks in both A and B on that one toss.
Clearly, AB cannot occur unless A and B overlap–the dart cannot stick
in two places at once. AUB occurs if
the dart sticks in either A or B (or both) on that one throw. A and B need
not overlap for AUB to occur.
This analogy is also useful for thinking about logical implication.
If A is a subset of B, the occurrence of A implies the occurrence of B;
we shall sometimes say that A implies B. In the dartboard model, the dart
cannot stick in A without sticking in B as well, so if A occurs, B must
occur also. If A implies B, AB=A, so P(AB)=P(A). If AB = {}, A implies
Bc and B implies Ac: if the dart sticks in A it did
not stick in B, and vice versa. If A implies B, then if B does not
occur A cannot occur either: Bc implies Ac,
so Bc is a subset of Ac.
The options in the next questions change only if you hold down the Shift
key while you reload the page. If you reload the page without holding down
the Shift key, the questions can be out of synch with the answers.
Conditioning means updating
probabilities to incorporate new information. The conditional
probability of A given B is the probability of the event
A, updated on the basis of the knowledge that the event
B occurred. Suppose that AB = {} (A and B are disjoint).
Then if we learn that B occurred, we know A did not occur, so we should
revise the probability of A to be zero (the conditional probability of
A given B is zero). On the other hand, suppose that AB = B (B is a subset
of A, so B implies A). Then if we learn that B occurred, we know A must
have occurred as well, so we should revise the probability of A to be 100%
(the conditional probability of A given B is 100%).
For in-between cases, the conditional
probability of A given B is defined to be
P(AB)
P(A|B)
=
———— ,
P(B)
provided P(B) is not zero (division by zero is undefined). “P(A|B)” is
pronounced “the (conditional) probability of A given B.” Why does this
formula make sense? First of all, note that it does give back the intuitive
answers we arrived at above: if AB = {}, then P(AB) = 0, so P(A|B) = 0/P(B)
= 0; and if AB = B, P(A|B) = P(B)/P(B) = 100%. Similarly, if we learned
that S occurred, this is not really new information (by definition,
S always occurs, because it contains all possible outcomes), so
we would like P(A|S) = P(A). This is how it works out: AS
= A, so P(A|S) = P(A)/P(S) = P(A)/100% = P(A).
Now suppose that A and B are not disjoint.
Then if we learn that B occurred, we can restrict attention to just those
outcomes that are in B, and disregard the rest of S, so we have
a new outcome
space that is just B. We need P(B) = 100% to consider B an outcome
space; we can make this happen by dividing all probabilities by P(B).
For A to have occurred in addition to B requires that AB occurred, so the
conditional
probability of A given B is P(AB)/P(B), just as we defined it above.
Example. We deal two cards from a well shuffled deck. What is
the conditional
probability that the second card is an Ace (event A), given that the
first card is an Ace (event B)? This is P(AB)/P(B) by definition. The (unconditional)
chance that the first card is an Ace is 100%/13 = 7.7%, because there are
13 possible faces for the first card, and all are equally likely. The chance
that both cards are Aces is as follows: from the four suits, we need to
pick two; there are 4C2 = 6 ways that can happen.
The total number of ways of picking two cards from the deck is 52C2
= 52×51/2 = 1326, so the chance that the two cards are both Aces
is (6/1326)×100% = 0.5%. The conditional
probability that the second card is an Ace given that the first card
is an Ace is thus 0.5%/7.7% = 5.9%. As we might expect, it is somewhat
lower than the chance that the first card is an Ace, because we know one
of the Aces is gone. We could approach this more intuitively as well: given
that the first card is an Ace, the second card is an Ace too if it is one
of the three remaining Aces among the 51 remaining cards. These possibilities
are equally likely if the deck was shuffled well, so the chance is 3/51
× 100% = 5.9%.
Two events are independent
if learning that one occurred does not affect the chance that the other
occurred. That is, if P(A|B) = P(A), and vice versa. A slightly
more general way to write this is that A and B are independent if P(AB)
= P(A) × P(B). (This covers the case that either P(A), P(B), or both,
are equal to zero, while the definition in terms of conditional probability
requires the probability in the denominator to be positive.) To reiterate:
two events are independent if and only if the probability that both events
happen simultaneously is the product of their unconditional probabilities.
If two events are not independent, they are dependent.
Independence and Mutual Exclusivity are Different!
In fact, the only way two events can be both mutually exclusive and
independent is if at least one of them has probability zero. If two events
are mutually exclusive, learning that one of them happened tells us that
the other did not happen. This is clearly informative: the conditional
probability of the second event given the first is zero! This changes the
(conditional) probability of the second event unless its (unconditional)
probability was already zero.
Independent events bear a special relationship to each other. Independence
is a very precise point between being disjoint (so that one event implies
that the other did not occur), and one event being a subset of the other
(so that one event implies the other).
Recap:
- If two events are mutually exclusive, they cannot both occur in the same
trial: the probability of their intersection is zero. The probability of
their union is the sum of their probabilities. - If two events are independent, they can both occur in the same trial
(except
possibly if at least one of them has probability zero). The probability
of their intersection is the product of their probabilities. The probability
of their union is less than the sum of their probabilities, unless at least
one of the events has probability zero.
The following figure represents two events, A and B, as subsets of a rectangle.
The probabilities of the events are proportional to their areas. Try dragging
the events in the figure around to make them independent (that is, so that
the area of their intersection is the product of their areas). Notice that
it is not easy to do: to get the probability of the intersection equal
to the product of the probabilities requires just the right amount of overlap.
If A and B are independent, so are
- A and Bc
- Ac and Bc
- Ac and B.
What kinds of events are independent? The outcomes of successive tosses
of a fair coin, the outcomes of random draws from a box with replacement,
etc. Draws without replacement are dependent, because what
can happen on a given draw depends on what happens on previous draws.
Example: Suppose I have a box with four tickets in it, labeled
1, 2, 3, and 4. I stir the tickets and then pick one, stir them again without
replacing the ticket I got, and pick another. Consider the event A = {I
get the ticket labeled 1 on the first draw} and the event B = {I get the
ticket labeled 2 on the second draw}. Are these events dependent or independent?
Solution: The chance that I get the 1 on the first draw is 25%.
The chance that I get the 2 on the second draw is 25%. The chance that
I get the 2 on the second draw given that I get the 1 on the first draw
is 33%, which is much larger than the unconditional chance that I draw
the 2 the second time. Thus A and B are dependent.
Now suppose that I replace the ticket I got on the first draw and stir
the tickets again before drawing the second time. Then the chance that
I get the 1 on the first draw is 25%, the chance that I get the 2 on the
second draw is 25%, and the conditional chance that I get the 2 on the
second draw given that I drew the 1 the first time is also 25%. A and B
are thus independent if I draw with replacement.
Example: Two fair dice are rolled independently; one is blue,
the other is red. What is the chance that the number of spots that show
on the red die is less than the number of spots that show on the blue die?
Solution: The event that the number of spots that show on the
red die is less than the number that show on the blue die can be broken
up into mutually exclusive events, according to the number of spots that
show on the blue die. The chance that the number of spots that show on
the red die is less than the number that show on the blue die is the sum
of the chances of those simpler events. If only one spot shows on the blue
die, the number that show on the red die cannot be smaller, so the probability
is zero. If two spots show on the blue die, the number that show on the
red die is smaller if the red die shows exactly one spot. Because the number
of spots that show on the blue and red dice are independent, the chance
that the blue die shows two spots and the red die shows one spot is (1/6)(1/6)
= 1/36. If three spots show on the blue die, the number that show on the
red die is smaller if the red die shows one or two spots. The chance that
the blue die shows three spots and the red die shows one or two spots is
(1/6)(2/6) = 2/36. If four spots show on the blue die, the number that
show on the red die is smaller if the red die shows one, two, or three
spots; the chance that the blue die shows four spots and the red die shows
one, two, or three spots is (1/6)(3/6) = 3/36. Proceeding similarly for
the cases that the blue die shows five or six spots gives the ultimate
result:
P(red die shows fewer spots than the blue die) = 1/36 + 2/36 + 3/36
+ 4/36 + 5/36 = 15/36.
Alternatively, one could just count the ways: there are 36 possibilities,
which can be written in a square table:
Blue Die
R
e
d
D
i
e
1,1
1,2
1,3
1,4
1,5
1,6
2,1
2,2
2,3
2,4
2,5
2,6
3,1
3,2
3,3
3,4
3,5
3,6
4,1
4,2
4,3
4,4
4,5
4,6
5,1
5,2
5,3
5,4
5,5
5,6
6,1
6,2
6,3
6,4
6,5
6,6
The outcomes above the diagonal comprise the event whose probability we
seek. There are 36 outcomes in all, of which 6 are on the diagonal. Half
of the remaining 36-6=30 are above the diagonal; half of 30 is 15. The
36 outcomes are equally likely, so the chance is 15/36. The outcomes highlighted
in yellow are one of the mutually exclusive pieces used in the computation
just above: the three ways the red die can show a smaller number of spots
than the blue die, when the blue die shows exactly 4 spots.
Hint: to solve this problem, you need to evaluate an expression of the
form
1 – (1-x)n,
where x is nearly zero and n is very large. You can find the
answer approximately using the following result:
(1-x)n = 1 + n×(-x) +
(n×(n-1)/2)×(-x)2
+ . . . + nCk×(-x)k
+ . . . + (-x)n.
The function (1-x)n is called a binomial; the fact
that the coefficient of xk in the expansion of (1-x)n
is nCk is the reason that nCk
is sometimes called a binomial coefficient. When x is very small, x2,
x3, . . . are much smaller still (and they get smaller
faster than nCk grows),
so the terms involving higher powers of x than x1 are effectively
negligable. That is, when x is nearly zero,
(1-x)n is approximately 1-n×x, so
1 – (1-x)n is approximately n×x.
Using that approximation is equivalent to ignoring the possibility that
the sentence is typed more than once. The probability that the sentence
is typed more than once is tiny compared to the chance that the sentence
is typed exactly once, which is already quite small.
We can rearrange the definition of conditional
probability to solve for the probability that both A and B occur (that
AB occurs) in terms of the probability that B occurs and the conditional
probability of A given B:
P(AB) = P(A|B)×P(B).
This is called the multiplication
rule.
Example: A deck of cards is shuffled well, then two cards are
drawn. What is the chance that both cards are aces?
P(card 1 is an Ace and card 2 is an Ace) = P(card 2 is an Ace | card
1 is an Ace)×P(card 1 is an Ace)
= 3/51 × 4/52 = 0.5%.
You can see that the multiplication rule can save you a lot of time!
Example: Suppose there is a 50% chance that you catch the 8:00am
bus. If you catch the bus, you will be on time. If you miss the bus, there
is a 70% chance that you will be late. What is the chance that you will
be late?
P(late) = P(miss the bus and late)
= P(late|miss the bus) × P(miss the bus)
= 0.5 × 0.7 = 35%.
Example: Suppose that 10% of a given population has benign chronic
flatulence. Suppose that there is a standard screening test for benign
chronic flatulence that has a 90% chance of correctly detecting that one
has the disease, and a 10% chance of a “false positive” (erroneously reporting
that one has the disease when one does not). We pick a person at random
from the population (so that everyone has the same chance of being picked)
and test him/her. The test is positive. What is the chance that the person
has the disease?
Solution: We shall combine several things we have learned. Let
D be the event that the person has the disease, and T be the event that
the person tests positive for the disease. The problem statement told us
that:
- P(D) = 10%.
- P(T|D) = 90%.
- P(T|Dc) = 10%.
The problem asks us to find P(D|T) = P(DT)/P(T). We shall find P(T) by
breaking T into two mutually exclusive pieces, DT and DcT, corresponding
to testing positive and having the disease (DT) and testing positive falsely
(DcT). Then P(T) is the sum of P(DT) and P(DcT).
We will find those two probabilities using the multiplication rule. We
need P(DT) for the numerator, and it will be one of the terms in the denominator
as well. The probability of DT is, by the multiplication rule,
P(DT) = P(T|D) × P(D) = 90% × 10% = 9%.
The probability of DcT is, by the multiplication rule and
the complement rule,
P(DcT) = P(T|Dc) × P(Dc) = P(T|Dc)
× (100%- P(D) ) = 10% × 90% = 9%.
By one of the axioms,
P(T) = P(DT) + P(DcT) = 9% + 9% = 18%,
because DT and DcT are mutually exclusive.Finally, plugging
in the definition of P(D|T) gives
P(D|T) = P(DT)/P(T) = 9%/18% = 50%.
Because only a small fraction of the population actually have benign
chronic flatulence, the chance that a positive test result for someone
selected at random from the population is a false positive is 50%, even
though the test is 90% accurate.
This problem illustrates Bayes’
Rule:
P(A|B) = P(B|A) × P(A) / ( P(B|A)×P(A) + P(B|Ac)
× P(Ac) ).
The numerator on the right is just P(AB), computed using the multiplication
rule. The denominator is just P(B), computed by partitioning B into the
mutually exclusive sets AB and AcB, and finding the probability
of each of those pieces using the multiplication rule.
Bayes’ Rule is useful to find the conditional probability of A given
B in terms of the conditional probability of B given A, which is the more
natural thing to measure in some problems. For example, in the disease-screening
problem just above, the natural way to calibrate a test is to see how well
it does at detecting a certain thing (e.g., a disease) when the
thing is present, and to see how poorly it does at raising false alarms
when the thing is not really present. These are, respectively, the conditional
probability of detecting the thing given that the condition is present,
and the conditional probability of incorrectly raising an alarm given that
the thing is not present. However, the interesting quantity for an individual
is the conditional chance that he or she has the disease, for example,
given that the test raised an alarm.