31 Implementing an OPM

Published

2025-11-08

We now try to build up a real prototype AI agent from basic principles, using the formulae summarized in ch. 29 and in the previous chapter. By design, this agent is as close to optimal as theoretically possible; so let’s call it an

or OPM for short.

Before starting, let’s agree on some terminology so as not to get confused in the discussion below.

We shall call task a repetitive inference problem with a specified set of units and variates. For instance, a task could be the consecutive prediction of the urgency of several incoming patients, given their mean of transportation.
We shall call application or instance of the task a single inference about a new unit, for example one new incoming patient.

31.1 Desired characteristics of the OPM

We design our Optimal Predictor Machine with the following specific characteristics:

It handles variates of nominal type (§ 12.2).
It handles inferences and decisions about approximately infinite populations, and its beliefs about the population are exchangeable (ch. 25).
Its initial beliefs about the population frequencies are represented by a Dirichlet-mixture distribution (ch. 28).
Before deployment, it learns from a set of \(N\) units.

Example of what kind of agent we want

Let’s give an example of what we want our agent to be able to do. Suppose we have a population having three nominal variates \(Z = (Y \mathbin{\mkern-0.5mu,\mkern-0.5mu}X \mathbin{\mkern-0.5mu,\mkern-0.5mu}W)\) (keep in mind that \(X\), \(Y\), \(W\) could each be a joint variate). Abbreviate the set of \(N\) training data as

\(\mathsfit{\color[RGB]{34,136,51}data}\coloneqq ( Z_{N}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{N} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\dotsb \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{2}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{2} \mathbin{\mkern-0.5mu,\mkern-0.5mu} Z_{1}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}z_{1} )\)

Recall that \(Z\) denotes all (nominal) variates of the population

where \(z_N, \dotsc, z_2, z_1\) are specific values, stored in some training dataset. To simplify things, we assume that no values are missing.

We want an agent that can draw inferences like the following ones, as often as required:

\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\), \(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\), etc.: inference about a predictand variate, without knowledge of any predictors.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same but for any two predictand variates.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same but for all three variates.
\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): inference about any one predictand variate, given information about one predictor variate.
\(P(X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{} Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): same, but given information about any pair of predictors.
\(P(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}W\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso\nonscript\:\vert\nonscript\:\mathopen{}X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\dotso \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\): inference about any two predictand variates, given information about one predictor.

Note that we are not fixing beforehand which variates are predictands and which are predictors. Once the agent has learnt from the training data, we want to be able to change on the fly, at each new application, what the predictands are, and what the predictors are (if any).

Pause for a second and ponder about the flexibility that we are requesting from our prototype agent! Consider that virtually all present-day machine-learning algorithms only work one way: a machine-learning algorithm designed to guess a label from some features cannot guess features from a label. Will we really manage to build an agent with the amazing versatility illustrated above?

31.2 Computations needed and computational challenges

The examples above of requested inferences show that the OPM agent must essentially use formulae (28.3)–(28.5) from § 28.3, which we repeat here:

\(\mathrm{P}(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y \nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{34,136,51}data}, \mathsfit{I}_{\textrm{d}}) = \frac{ \sum_{k= k_{\text{mi}}}^{k_{\text{ma}}} \Bigl(\tfrac{2^k}{M_Y} + \#y\Bigr) \cdot \operatorname{aux}(k) }{ \sum_{y}\sum_{k= k_{\text{mi}}}^{k_{\text{ma}}} \Bigl(\tfrac{2^k}{M_Y} + \#y\Bigr) \cdot \operatorname{aux}(k) }\) (28.3)

\(\mathrm{P}(Y\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}y \nonscript\:\vert\nonscript\:\mathopen{} X\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}x \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}, \mathsfit{I}_{\textrm{d}}) = \frac{ \sum_{k= k_{\text{mi}}}^{k_{\text{ma}}} \Bigl(\tfrac{2^k}{M_Y \cdot M_X} + \#(y,x)\Bigr) \cdot \operatorname{aux}(k) }{ \sum_{y}\sum_{k= k_{\text{mi}}}^{k_{\text{ma}}} \Bigl(\tfrac{2^k}{M_Y \cdot M_X} + \#(y,x)\Bigr) \cdot \operatorname{aux}(k) }\) (28.4)

with \(\operatorname{aux}(k) \coloneqq \frac{ \prod_{x,y,w} \Bigl(\frac{2^{k}}{M} + \#(x,y,w) - 1\Bigr)! }{ \bigl(2^{k} + N\bigr)! } \cdot \frac{ \bigl(2^{k} -1 \bigr)! }{ {\Bigl(\frac{2^{k}}{M} - 1\Bigr)!}^M }\) (28.5)

The values of \(\operatorname{aux}(k)\) can be calculated just once, when the OPM agent is built, and stored. Subsequently the agent will draw inferences by using (28.3) or (28.4) as needed. To use those formulae, the agent needs to store the counts \(\#(x,y,w,\dotsc)\), which it found in the training data, for all combinations of values \(x,y,w,\dotsc\).

This kind of storage and computation could be implemented in a straightforward way if we had unlimited storage and computation precision. But in a real implementation we must face several difficulties. Here are the main difficulties and their solutions:

Finite precision: Owing to finite precision, the operations in the formulae may easily lead to overflow or underflow: large numbers are treated as infinity, and small non-zero numbers as 0. For instance this is what happens if we directly compute something like \((2^{10})! / (2^{10})!\), obviously equal to \(1\):

factorial(2^10) / factorial(2^10)

[1] NaN

One way to bypass this problem is by rewriting the formulae in ways that are mathematically equivalent but less prone to over- and under-flow. For example we can use identities like

\[ x / y = \exp\bigl(\ln x - \ln y\bigr)\ ,\quad x, y > 0 \ . \]

Now indeed it works; note that lfactorial() is log(factorial()) in R:

exp( lfactorial(2^10) - lfactorial(2^10) )

[1] 1

Another useful identity that avoids over- and under-flow, if \(\pmb{x}\) is a vector of positive numbers, is the following:

\[ \frac{\pmb{x}}{\operatorname{\texttt{sum}}(\pmb{x})} = \frac{ \operatorname{\texttt{exp}}\bigl(\operatorname{\texttt{log}}(\pmb{x}) - \operatorname{\texttt{max}}(\operatorname{\texttt{log}}(\pmb{x}))\bigr) }{\operatorname{\texttt{sum}}\bigl( \operatorname{\texttt{exp}}\bigl(\operatorname{\texttt{log}}(\pmb{x}) - \operatorname{\texttt{max}}(\operatorname{\texttt{log}}(\pmb{x}))\bigr) \bigr)} \]

Storage: With many variates and large domains, we may run out of memory in storing all possible counts \(\#(x,y,w,\dotsc)\). For instance if we have four variates with 20 possible values each, we would need to store \(4^{20}\) integers, which would take more than 4 000 GB:

try( x <- integer(length = 4^20) )

Error : cannot allocate vector of size 4096.0 Gb

We can bypass this problem again by using smart mathematical manipulations. In the case of formulae (28.3)–(28.5), the product over all possible values \((x,y,w,\dotsc)\) can be rewritten in one over all different values of the counts, which usually has much fewer terms. For example, if we have \(N=10000\) datapoints, and \(4^{20} -1\) counts are equal to \(9000\), while one count is equal to \(1000\), then we only need to store these four numbers rather than \(4^{20}\) numbers!

Speed: The formulae that the agent uses may involve sums over many terms, or repeated computations for many different variate values. Computation speed may therefore become an issue. There are two kinds of solutions to this problem. The first is, again, to use mathematical identities to rewrite formulae in ways that require fewer computations. The second is to exploit computation features of the programming language used to code the agent, such as vectorization or parallel computing. In our case we shall use some R functions that performs computations in a vectorized way, that is, using underlying fast C or C++ implementations.

You see that mathematical “tricks” become very important when we must implement formulae with finite precision and limited memory. Unfortunately getting acquainted with such tricks requires a separate course.

For the extra curious

§6.1 in Numerical Recipes

Note that the solutions just discussed are not approximations. Even if we use different mathematical formulae, they are still equivalent to the original ones. The internal logic of our OPM agent is therefore still fully correct. In other situations the mathematical or computational solutions above may not be enough, and then we may need to resort to approximations, as it often happens with machine-learning algorithms.

Exercise 31.1

Try to prove some of the mathematical identities above.

31.3 The code

We implement the OPM agent and its inferences through three main functions. These and other helper functions are defined in the script OPM_nominal.R:

buildagent()

creates an “agent” object that stores the built-in information and the information learned from the training data. The built-in information consists in the choices of \(k_{\text{mi}}\) and \(k_{\text{ma}}\) parameters, the \(\operatorname{aux}(k)\) parameters, and the list of variates and of their domains – the agent knows what are the possible values before seeing any data. The learned information consists in the set of counts \(\#(x,y,\dotsc)\) for all joint values of the variates.

We can used this function to build several agents, which differ in their background information or in the data they have learned.

We write this function with an option savememory for storing the learned information in a memory-efficient way, if needed.
infer()

asks an agent to draw an inference for a new unit, specifying the desired predictands and the known predictors for the new unit, if any are available. It returns the agent’s degrees of belief about the predictands, using formulae (28.3)–(28.5).

The formulae must be implemented in slightly different ways, depending on whether the learned information is stored in a memory-efficient way. For this reason we actually have two implementations of this function, called infer.agent() and infer.agentcompressed().

This function can be used as often as we please, and with any agents we please.
decide()

asks an agent to make a decision, specifying the list of possible decisions, the predictands and their probabilities, and the utilities for the different decisions and outcomes. We shall discuss this function more in detail in ch. 37.

Exercise 31.2

Open the script OPM_nominal.R and locate the functions buildagent() and infer.agent(). In each, identify the lines of code that implement the various calculations discussed above. Note that alpha stands for \(2^k\), and counts stands for the array or list of counts \(\#(y,x,\dotsc)\).

Besides the three main functions above, we can write other functions that help us handling the inference task and calculate other quantities of interest:

guessmetadata()

takes a dataset and builds a preliminary metadata file, encoding the information about variates and domain guessed from the dataset. Typically we have to correct this preliminary file to include values that may be missing from the learning data.
rF()

draws one or more possible full-population frequency distribution \(\boldsymbol{f}\), according to an agent’s degree of belief \(\mathrm{p}(F\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}\boldsymbol{f}\nonscript\:\vert\nonscript\:\mathopen{} \mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\) updated after learning the data.
plotFbelief()

plots, as a generalized scatter plot, the possible full-population marginal frequency distributions for a single (not joint) predictand variate. If required it also plots the final probability distribution obtained with infer().
mutualinfo()

asks an agent to calculate the mutual information (§ 18.5) between any two sets of variates.

In the next chapter we give some more documentation on these functions and on how to use them, and in ch. 33 we use them in a concrete task.