14 Probability distributions
Motivation for the “Inference II” part
In the “Data I” part we developed a language, that is, particular kinds of sentences, to approach inferences and probability calculations typical of data-science and engineering problems.
In the present part we focus on probability calculations that often occur with this kind of sentences and data. We also focus on how to visually represent such probabilities in useful ways.
Always keep in mind that at bottom we’re just using the four fundamental rules of inference over and over again – nothing more than that!
14.1 Distribution of probability among values
When an agent is uncertain about the value of a quantity, its uncertainty is expressed and quantified by assigning a degree of belief, conditional on the agent’s knowledge, to all the possible cases – all the possible values that could be the true one.
For a temperature measurement, for instance, the cases could be “The temperature is measured to have value 271 K”, “The temperature is measured to have value 272 K”, and so on up to 275 K. These cases are expressed by mutually exclusive and exhaustive sentences. Denoting the temperature with
The agent’s belief about the quantity is then expressed by the probabilities about these five sentences, conditional on the agent’s state of knowledge, which we may denote by the letter
Note that they sum up to one:
This collection of probabilities is called a probability distribution, because we are distributing the probability among the possible alternatives.
Let’s see how probability distributions can be represented and visualized for the basic types of quantities discussed in § 12.
We start with probability distributions over discrete domains.
14.2 Discrete probability distributions
Tables and functions
A probability distribution over a discrete domain can obviously be displayed as a table of values and their probabilities. For instance
value | 271 K | 272 K | 273 K | 274 K | 275 K |
---|---|---|---|---|---|
probability | 0.04 | 0.10 | 0.18 | 0.28 | 0.40 |
In the case of ordinal or interval quantities it is sometimes possible to express the probability as a function of the value. For instance, the probability distribution above could be summarized by this function of the value
A graphical representation is often helpful to detect features, peculiarities, and even inconsistencies in one or more probability distributions.
Histograms and area-based representations
A probability distribution for a nominal, ordinal, or discrete-interval quantity can be neatly represented by a histogram.
The possible values are placed on a line. For an ordinal or interval quantity, the sequence of values on the line should correspond to their natural order. For a nominal quantity the order is irrelevant.
A rectangle is then drawn above each value. The rectangles might be contiguous or not. The bases of the rectangles are all equal, and the areas of the rectangles are proportional to the probabilities. Since the bases are equal, this implies that the heights of the rectangles are also proportional to the probabilities.
Such kind of drawing can of course be horizontal, vertical, upside-down, and so on, depending on convenience.
Since the probabilities must sum to one, the total area of the rectangles represents an area equal to 1. So in principle there is no need of writing probability values on some vertical axis, or grid, or similar visual device, because the probability value can be visually read as the ratio of a rectangle area to the total area. An axis or grid can nevertheless be helpful. Alternatively the probabilities can be reported above or below each rectangle.
Nominal quantities do not have any specific order, so their values do not need to be ordered on a line. Other area-based representations, such as pie charts, can also be used for these quantities.
Line-based representations
Histograms give faithful representations of discrete probability distributions. Their graphical bulkiness, however, can be a disadvantage in some situations, for instance when we want to have a clearer idea of how the probability varies across values for ordinal or interval quantities; or when we want to compare several different probability distributions over the same values.
In these cases we can use standard line plots, or variations thereof. Compare the following example.
A technician wonders which component of a laptop failed first (only one can fail at a time), with seven possible alternatives:
Before examining the laptop, the technician’s belief about which component failed first is distributed among the seven alternatives as shown by the blue histogram with solid borders. After a first inspection of the laptop, the technician’s belief has a new distribution, shown by the red histogram with dashed borders:
It requires some concentration to tell the two probability distributions apart, for example to understand where their peaks are. Let us represent them by two line plots instead: solid blue with circles for the pre-inspection belief distribution, and dashed red with squares for the post-inspection one:
this line plot displays more cleanly the differences between the two distributions. We see that at first the technician most strongly believed the
14.3 Probability densities
Distributions of probability over continuous domains present several counter-intuitive aspects, which essentially arise because we are dealing with uncountable infinities – while often using linguistic expressions that make only sense for countable infinities. Here we follow a practical and realistic approach for working with such distributions.
Consider a quantity
1 more precisely the relative width
#### R code
## difference in 15th decimal digit
> 1.234567890123456 == 1.234567890123455
1] FALSE
[
## difference in 16th decimal digit
> 1.2345678901234567 == 1.2345678901234566
1] TRUE [
The value 1.3
really represents a range between 1.29999999999999982236431605997495353221893310546875
and 1.300000000000000266453525910037569701671600341796875
, this range coming from the internal binary representation of 1.3
. Often the width
When we consider a distribution of probability for a continuous quantity, the probabilities are therefore distributed among such small ranges, not among single values.
Since these ranges are very small, they are also very numerous. But the total probability assigned to all of them must still sum up to
In would be impractical to work with such small probabilities. We use probability densities instead. As implied by the term “density”, a probability density is the amount of probability
which is a more convenient number to work with.
Probability densities are convenient because they usually do not depend on the range width
As you see, even if we consider a range with double the width as before, the probability density is still the same.
In these notes we’ll denote probability densities with a lowercase
This definition works even if we don’t specify the exact value of
A helpful practice (though followed by few texts) is to always write a probability density as
where “
Physical dimensions and units
In the International System of Units (SI), “Degree of belief” is considered to be a dimensionless quantity, or more precisely a quantity of dimension “1”. This is why we don’t write units such as “metres” (
2 See also the material at the International Bureau of Weights and Measures (BIPM)
A probability density, however, is defined as the ratio of a probability amount and an interval
As another example, suppose that an agent with background knowledge
It is an error to report probability densities without their correct physical units. In fact, keeping track of these units is often useful for consistency checks and finding errors in calculations, just like in other engineering or physics calculations.
On the other hand, if we write probability densities as previously suggested, in this case as “
14.4 Representation of probability densities
Line-based representations
The histogram and the line representations become indistinguishable for a probability density.
If we represent the probability
The rectangles, however, are so thin (usually thinner than a pixel on a screen) that they appear just as vertical lines, and together they look just like a curve delimiting a coloured area. If we don’t colour the area underneath the curve, then we just have a line-based, or rather curve-based, representation of the probability density.
As the width
Keep in mind that the curve representing the probability density is not quite a function. In fact it’s best to call it a “density” or a “density function”. There are important reasons for keeping this distinction, which have also consequences for probability calculations, but we shall not delve into them for the moment.
Scatter plots
Line plots of a probability density are very informative, but they can also be slightly deceiving. Try the following experiment.
Consider a continuous quantity
We want to represent the amount of probability in any small range, say between
Suppose that we have 50 lines available to distribute this way. Where should we place them?
In a scatter plot, the probability density is (approximately) represented by density of lines, or points, or similar objects, as in the examples above (only one of the examples above, though, correctly matches the density represented by the curve).
As the experiment and exercise above may have demonstrated, line plots sometimes give us slightly misleading ideas of how the probability is distributed across the domain. For example, peaks at some values make us overestimate the probability density around those values. Scatter plots often give a less misleading representation of the probability density.
Scatter plots are also useful for representing probability densities in more than one dimension – sometimes even in infinite dimensions! They can moreover be easier to produce computationally than line plots.
@@ TODO Behaviour of representations under transformations of data.
14.5 Combined probabilities
A probability distribution is defined over a set of mutually exclusive and exhaustive sentences. In some inference problems, however, we do not need the probability of those sentences, but of some other sentence that can be obtained from them by an or
operation. The probability of this sentence can then be obtained by a sum, according to the or
-rule of inference. We can call this a combined probability. Let’s explain this procedure with an example.
Back to our initial assembly-line scenario from Ch. 1, the inference problem was to predict whether a specific component would fail within a year or not. Consider the time when the component will fail (if it’s sold), and represent it by the quantity
which we can shorten to
Suppose that the inspection device – our agent – has internally calculated a probability distribution for
Their values are stored in the file failure_probability.csv
and plotted in the histogram on the side.
What’s important for the agent’s decision about rejecting or accepting the component, is not the exact time when it will fail, but only whether it will fail within the first year or not. That is, the agent needs the probability of the sentence or
of the first 12 sentences expressing the values of
The probability needed by the agent is therefore
which can be calculated using the or
-rule, considering that the sentences involved are mutually exclusive: