32 Prototype code and workflow
\(\DeclarePairedDelimiter{\set}{\{}{\}}\)
A concise documentation is here given of the R functions designed in ch. 31 and described in § 31.3, together with an example of how they are used in a task.
The functions are collected in https://github.com/pglpm/ADA511/blob/master/code/OPM_nominal.R
32.1 Function documentation
Optional arguments are written with = ..., which specifies their default values. Some additional optional arguments, mainly used for testing, are omitted in this documentation.
-
guessmetadata(data, file = NULL) -
- Arguments:
-
data: either a string with the file name of a dataset in.csvformat (with header line), or a dataset given as adata.frameobject.file: a string specifying the file name of the metadata file. If nofileis given anddatais a file name, thenfilewill be the same name asdatabut with the prefixmeta_. If nofileis given anddatais not a string, then the metadata are output tostdout.
- Output:
-
- either a
.csvfile containing the metadata, or adata.frameobject asstdout.
- either a
-
buildagent(metadata, data = NULL, kmi = 0, kma = 20, savememory = FALSE) -
- Arguments:
-
metadata: either a string with the name of a metadata file in.csvformat, or metadata given as adata.frame.data: either a string with the file name of a training dataset in.csvformat (with header line), or a training dataset given as adata.frame.kmi: the \(k_{\text{mi}}\) parameter.kma: the \(k_{\text{ma}}\) parameter.
- Output:
-
- an object of class
agentoragentcompressed, consisting of a list of an array or listcountsand three vectorsalphas,auxalphas,palphas. The classagentcompressedsaves the counts in a memory-efficient way.
- an object of class
-
infer(agent, predictand, predictor = NULL) -
- Arguments:
-
agent: anagentobject.predictand: a vector of strings with the names of variates.predictor: either a list of elements of the formvariate = value, or a corresponding one-rowdata.frame.
- Output:
-
- the joint probability distribution \(\mathrm{P}(\mathit{predictand} \nonscript\:\vert\nonscript\:\mathopen{} \mathit{predictor}\mathclose{}\mathord{\nonscript\mkern 0mu\textrm{\small=}\nonscript\mkern 0mu}\mathopen{}{\small\verb;values;} \mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{\color[RGB]{34,136,51}data}\mathbin{\mkern-0.5mu,\mkern-0.5mu}\mathsfit{I}_{\textrm{d}})\) for all possible values of the predictands.
- Notes:
-
- If
predictorsis present, the agent is acting as a “supervised-learning” algorithm. Otherwise it is acting as an “unsupervised-learning” algorithm. The obtained probabilities could be used to generate a new unit similar to the ones observed. - The variate names in the
predictandandpredictorinputs must match some variate names known to the agent. Unknown variate names are discarded. The function gives an error if predictand and predictor have variates in common.
- If
-
decide(probs, utils = NULL) -
- Arguments:
-
probs: a probability distribution for one or more variates.utils: a named matrix or array of utilities. The rows of the matrix correspond to the available decisions, the columns or remaining array dimensions correspond to the possible values of the predictand variates.
- Output:
-
a list of elements
EUsandoptimal:EUsis a vector containing the expected utilities of all decisions, sorted from highest to lowestoptimalis the decision having maximal expected utility, or one of them, if more than one, selected with equal probability
- Notes:
-
- If
utilsis missing orNULL, a matrix of the form \(\begin{bsmallmatrix}1&0&\dotso\\0&1&\dotso\\\dotso&\dotso&\dotso\end{bsmallmatrix}\) is assumed (which corresponds to using accuracy as evaluation metric).
- If
( Further documentation will be added )
-
rF(n = 1, agent, predictand, predictor = NULL) - (generate population-frequency samples)
-
mutualinfo(agent, A, B, base = 2) - (calculate mutual information)
32.2 Typical workflow
The workflow discussed here is just a guideline and reminder of important steps to be taken when applying an optimal agent to a given task. There cannot be more than a guideline, because each data-science and engineering problem is unique. Literally following some predefined, abstract workflow typically leads to sub-optimal results. Sub-optimal results can be acceptable in some unimportant problems, but are unacceptable in important problems, where, say, people’s lives can be involved, such as medical ones.
We can roughly identify four main stages:
- Define the task
- In this stage we clarify what the task to be solved is – and why. Asking “why” often reveals the true needs and goals underlying the problem. If possible, the task is formalized. For example, the formal notions introduced in the parts Data I and Data II might be used: a specific statistical population is specified, with well-defined units and variates, and so on.
- Collect & prepare background info
- Background and metadata information, as well as auxiliary assumptions, are collected, examined, prepared. Remember that this kind of information is required in order to make sense of the data (§ 24.3). In this stage we ask questions such as “Is our belief about the task exchangeable?”, “Can the statistical population be considered infinite?”, and similar question that make clear which kinds of ready-made methods and approximations are acceptable or not. This stage also helps for correcting possible deficiencies in the training data used in the next stage. For instance, some possible variate values might not appear in the training data, owing to their rarity in the statistical population.
In this stage it is especially important to specify:
- definition of units (what counts as “unit” and can be used as training data?)
- definition of variates and their domains
- initial probabilities
- possible decisions that may be required in the repeated task applications
- utilities associated with the decisions above
- Collect & prepare training data
- Units similar to the units of our future inferences, but of which we have more complete information, are collected and examined. These are the “training data”. They are used in the next step to make the agent learn from examples. The problematic notion of similarity was discussed in § 20.2: what counts as “similar” is difficult to decide, and often we shall have to revise our decision. Sometimes no units satisfactorily similar to those of interest are available. In this case we must assess which of their informational relationships can be considered similar, which may lead us to use the agent in slightly different ways. We must also check whether training units with partially missing variates can be used by our agent or not.
- Prepare OPM agent
- The background information and training data (if any available) are finally fed to the agent.
- Repeated application
- Inferences are drawn, and decision made, for each new application instance. With our prototype agent, the inferences and the decisions can in principle be different from instance to instance.
Every new application can be broken down into several steps:
( Remaining steps to be added soon )