20 Populations and variates
\[ \DeclarePairedDelimiters{\set}{\{}{\}} \DeclareMathOperator*{\argmax}{argmax} \]
20.1 Collections of similar quantities: motivation
In the latest chapters we gradually narrowed our focus on a particular kind of inference: inferences that involve collections of similar quantities (each of which can be simple, joint, or complex). “Similar” means that all such quantities have the same domain – for instance, each of them has possible values \(\set{{\small\verb;urgent;}, {\small\verb;non-urgent;}}\); or each of them has possible values between \(0\,\mathrm{\textcelsius}\) and \(100\,\mathrm{\textcelsius}\) – and they have a similar meaning and measurement procedure. They can be considered different “instances” of the same quantity, so to speak. We saw an example in the three-patient hospital scenario of § 17.3, with the three “urgency” quantities \(U_1\), \(U_2\), \(U_3\), corresponding to the urgency of three consecutive patients. Here are other examples:
- Stock exchange
- We are interested in the daily change in closing price of a stock, during 1000 days. Each day the change can be positive (or zero), or negative.
- Mars prospecting
- Some robot examines 1000 similar-sized rocks in a large crater on Mars. Each rock either contains haematite, or it doesn’t.
- Glass forensics
- A criminal forensics department has 215 glass fragments collected from many different crime scenes. Each fragment is characterized by a refractive index (between \(1\) and \(\infty\)), a percentage of Calcium (between \(0\%\) and \(100\%\)), a percentage of Silicon (ditto), and a type of origin (for example “from window of building”, “from window of car”, and similar).
It is easy to think of many other and very diverse examples, with even more complex variates, such as images or words. We shall now try to abstract and generalize this similarity.
20.2 Units, variates, statistical populations
Consider a large collection of entities that are somehow similar to one another. We call these entities units. These units could be, for instance:
- physical objects such as cars, windmills, planets, or rocks from a particular place;
- creatures such as animals of a particular species, or human beings, maybe with something in common such as geographical region; or plants of a particular kind;
- automatons having a particular application;
- software objects such as photos;
- abstract objects such as functions or graphs;
- the rolls of a particular die or the tosses of a particular coin;
- the weather conditions in several days.
These units are similar to one another in that they have some set of attributes1 common to all. These attributes can present themselves in a specific number of mutually-exclusive guises, which can be different from unit to unit. For instance, the attributes could be:
1 The term features is frequently used in machine learning
- “colour”, each unit being, say, green, blue, or yellow;
- “mass”, each unit having a mass between \(0.1\,\mathrm{kg}\) and \(10\,\mathrm{kg}\);
- “health condition”, each unit (an animal or human in this case) being
healthy
orill
; or maybe being affected by one of a specific set of diseases; - containing something, for instance a particular chemical substance;
- “having a label”, each unit having one of the labels
A
,B
,C
; - a complex combination of several simpler attributes like the ones above.
The units also have additional attributes (they must, otherwise we wouldn’t be able to distinguish each unit from all others), which we simply don’t consider or can’t measure. We’ll discuss this possibility later.
From this description it’s clear that the attributes of each unit are a (possibly joint) quantity, as defined in § 12.1.1. Once the units and their attributes are specified, we have a set of as many quantities as there are units. All these quantities have identical domains.
A quantity which is a collection of attributes of a set of units is called a variate. So when we speak about a variate it is understood that this is a quantity that appears, replicated, in some set of units.
We call a collection of units so defined a statistical population, or just population when there’s no ambiguity. The number of units is called the size of the population.
The notion of statistical population is extremely general and encompassing: many different things can be thought of as a population. In speaking of “data”, what is often meant is a particular statistical population. The specification of a population requires precision, especially when it is used to draw inferences, as we shall see later. A statistical population has not been properly specified until two things are precisely specified:
- a way to determine whether something is a unit or not: inclusion and exclusion criteria, means of collection, and so on
- a definition of the variate considered, its possible values, and how it is measured
20.3 Populations with joint variates
The definition of statistical population (§ 20.2) makes it clear that the quantity associated with each unit can be of arbitrary complexity. In particular it could be a joint quantity (§ 13.1), that is, a collection of quantities of a simpler type.
We saw an example at the beginning of this chapter, with a population relevant for glass forensics:
- units: glass fragments (collected at specific locations)
- variate: the joint variate \((\mathit{RI}, \mathit{Ca}, \mathit{Si}, \mathit{Type})\) consisting of four simple variates:
- \(\mathit{R}\)efractive \(\mathit{I}\)ndex of the glass fragment (interval continuous variate), with domain from \(1\) (included) to \(+\infty\)
- weight percent of \(\mathit{Ca}\)lcium in the fragment (interval discrete variate), with domain from \(0\) to \(100\) in steps of 0.01
- weight percent of \(\mathit{Si}\)licon in the fragment (interval discrete variate), with domain from \(0\) to \(100\) in steps of 0.01
- \(\mathit{Type}\) of glass fragment (nominal variate), with seven possible values
building_windows_float_processed
,building_windows_non_float_processed
,vehicle_windows_float_processed
,vehicle_windows_non_float_processed
,containers
,tableware
,headlamps
Here is a table with the values of the joint variate \((\mathit{RI}, \mathit{Ca}, \mathit{Si}, \mathit{Type})\) for ten units:
unit | \(\mathit{RI}\) | \(\mathit{Ca}\) | \(\mathit{Si}\) | \(\mathit{Type}\) |
---|---|---|---|---|
1 | \(1.51888\) | \(9.95\) | \(72.50\) | tableware |
2 | \(1.51556\) | \(9.41\) | \(73.23\) | headlamps |
3 | \(1.51645\) | \(8.08\) | \(72.65\) | building_windows_non_float_processed |
4 | \(1.52247\) | \(9.76\) | \(70.26\) | headlamps |
5 | \(1.51909\) | \(8.78\) | \(71.81\) | building_windows_float_processed |
6 | \(1.51590\) | \(8.22\) | \(73.10\) | building_windows_non_float_processed |
7 | \(1.51610\) | \(8.32\) | \(72.69\) | vehicle_windows_float_processed |
8 | \(1.51673\) | \(8.03\) | \(72.53\) | building_windows_non_float_processed |
9 | \(1.51915\) | \(10.09\) | \(72.69\) | containers |
10 | \(1.51651\) | \(9.76\) | \(73.61\) | headlamps |
The joint-variate value for unit 4, for instance, is
\[ \mathit{RI}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}1.52247 \land \mathit{Ca}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}9.76 \land \mathit{Si}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}70.26 \land \mathit{Type}_{\color[RGB]{204,187,68}4}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;headlamps;} \]
2 This is an adapted version of the UCI “adult-income” dataset