By Raymond Tana in Machine Learning — Aug 27, 2023

MDL, pt 1

We start a series of posts introducing MDL: Minimum Description Length and the MDL Principle; a theory of inductive inference.

The Minimum Description Length (MDL) Principle is a theory of inductive inference applicable to general problems in statistics, machine learning, and pattern recognition. It generally states: "the best explanation for a given set of data is provided by the shortest description of that data" [Grünwald 19]. Various phenomena have been explained from an MDL perspective, including some of deep learning's recent proliferation. In this blog post, I begin my effort to understand MDL better by summarizing the first few sections of Grünwald's Minimum Description Length Revisited [Grünwald 19]. I will make further posts on this paper and more sources as time goes on.

MDL's origins are in data compression back in 1978 when it was introduced by Jorma Rissanen in the paper Modeling by the Shortest Data Description, but the theory has undergone many developments in the past few decades [Rissanen 78]. Grünwald paints the theory as an extension of both penalized likelihood and Bayesian approaches to model selection and averaging and hypothesis testing [Grünwald 19].

MDL has historically been tricky to apply as it has required a working knowledge of both statistics and information theory, and as its uses seemed impractical on big data or arbitrary parameter spaces. Grünwald's overview of MDL highlights some recent developments which have helped overcome these issues, as well as novelly and accessibly presents the MDL Principle independent of information theory [Grünwald 19].

We will be interested in codes, which encode data in a lossless way. Of course, we hope to compress the data to be as short in length as possible.

Please read the section on Notation below before continuing.

Notation

A Statistical Model is a family of Probability Distributions; written out:

\[\mathcal{M} = \{ p_\theta : \theta \in \Theta\},\]

where each \(p_\theta\) represents a probability density function (pdf) or probability mass function defined on a sequence of arbitrary length. We may also label the corresponding probability distribution by \(p_\theta\).

We may even collect statistical models into Families of Statistical Models:

\[\{\mathcal{M}_\gamma : \gamma \in \Gamma\},\]

where each \(\mathcal{M}_\gamma\) is a statistical model with its own \(p_\theta\), whose index comes from some \(\Theta_\gamma\).

Statistical models attempt to model data \(z^n := (z_1, ..., z_n) \in \mathcal{Z}^n\), where the data came from the outcome space \(\mathcal{Z}\).

If a piece of data \(z^n\) is independent and identically-distributed (i.i.d.) according to a distribution \(p_\theta\) across its components, then we may compute the probability assigned to this data component-wise:

\[p_\theta(z^n) = \prod_{i=1}^n p_{\theta}(z_i).\]

The maximum likelihood (ML) estimator given model \(\mathcal{M} = \{ p_\theta : \theta \in \Theta\}\) is denoted \(\hat{\theta}_{\text{ML}}\) (given it exists and is unique). Alter the notation if dealing with indexed \(\mathcal{M}_{\gamma}\) to be: \(\hat{\theta}_{\text{ML} \mid \gamma}\). Other estimators will be denoted by \(\u{\theta}\), and \(\hat{\theta}_v\) for the MDL estimator with luckiness function \(v\).

Motivation

Given a model \(\mathcal{M} = \{ p_\theta : \theta \in \Theta\}\), let's fix a piece of data \(z^n\) to which this model might try to fit. We start by defining the optimal distribution \(p_\theta\) within the model describing the data (assuming existence and uniqueness):

\[\hat{\theta}_{\text{ML}}(z^n) := \argmax\limits_{\theta \in \Theta} p_\theta(z^n).\]

We call this choice for \(\theta\) to be the Maximum Likelihood (ML) estimator for \(\mathcal{M}\) and \(z^n\).

Plugging in this \(\hat{\theta}_{\text{ML}}(z^n)\) as an index to a distribution in \(\mathcal{M}\), we get a numerical measure of how "fit" the model is to the data:

\[F(z^n) := p_{\hat{\theta}_{\text{ML}}(z^n)}(z^n),\]

or the likelihood assigned to the data by the best-fitting distribution within the model.

How does this measure of model fitness behave?

Enlarging the model to possess more candidate distributions gives the model more fitting power, which means \(F_\gamma(z^n)\) can only increase.
One might enlarge the model so much that for each \(z^n\) in the output space, \(\mathcal{M}\) contains a distribution \(p\) for which \(p(z^n) = 1\). In this case, \(F_\gamma(z^n) \equiv 1\) on all data.

Across a whole family \((\mathcal{M}_\gamma)_{\gamma \in \Gamma}\) of models with discrete data, denoting by \(\hat{\theta}_{\text{ML} \mid \gamma}\) their respective maximum likelihood estimators and by \(F_\gamma(z^n)\) their fits to the data, we might ask which model is the fittest for a given piece of data.

Overfitting (i.e., a model being particularly specialized to given data yet unable to generalize its fitness to further data) can occur when our heuristic for selecting \(\gamma\) simply asks that we maximizing \(F_\gamma(z^n)\) across all inputs.

Overfitting is avoided by taking into account the complexity of the model itself – not only do we want a model that explains the data well, but also a model that does so in the simplest way possible. Perhaps complexity of models is to be measured by asking how many distributions it contains, or by some other metric.

This way of selecting a model most fit for a piece of data takes a maximum across all distributions found within each model. Universal Modeling, the subject of the next section, introduces a method of model comparison via associating to each model \(\mathcal{M}_\gamma\) some "universal" distribution \(\bar{p}_\gamma(z^n)\) which is computed from the distributions found within that model. This associated distribution is used as the representative of that model to attempt to explain arbitrary pieces of data. So rather than compare models' fitness to a piece of data by checking how all of their included distributions weight the piece of data, universal modeling compares models by checking how their associated universal distributions weight the piece of data. In symbols, we redefine

\[F_\gamma(z^n) := \bar{p}_\gamma(z^n).\]

In this way, we also see that the total probability mass on all potential outcomes \(z^n\) sums to \(1\), as \(\bar{p}\) is designed to be a probability distribution itself. This prevents assigning overly high fit \(F_\gamma (z^n)\) to overly many data sequences: no matter what our method for constructing \(\bar{p}_\gamma\), we have \(\sum_{z^n} F_\gamma(z^n) = 1\), so a good fit on some \(z^n\) necessarily implies a worse fit on other data. This means that we will not select a model simply because it contained some spurious distribution that fit our sampling of data particularly well. This shows the power for universal modeling's method of measuring fitness to prevent overfitting.

According to Grünwald, the argument to measure a piece of data's fit relative to a model using a single distribution mimics the Bayesian Occam's Razor argument to use the Bayes marginal distribution factor as shown later in this post, but MDL does not restrict us to only use universal distributions in the Bayesian form [Rasmussen and Ghahramani 00] [Grünwald 19].

Universal Modeling

The following is an introduction to MDL through model comparison. For a countable family of statistical models \(\mathcal{M}_1, \mathcal{M}_2, ...\) indexed by \(\gamma \in \Gamma\) we wish to adopt a rule of associating to each model a single distribution:

\[\text{\textbf{Assocation Rule}}: \mathcal{M}_\gamma \rightsquigarrow \bar{p}_\gamma\]

which we'd call the universal distribution related to \(\mathcal{M}_\gamma\). This \(\bar{p}_\gamma\) need not be a member of \(\mathcal{M}_\gamma\).

With this, we'd call \(-\log \bar{p}_\gamma (z^n)\) the code length of the tuple of data \(z^n \in \mathcal{Z}^n\) under this universal distribution/code. Assume for generality that there is a weight distribution \(\pi\) on the indices \(\Gamma = \{1, 2, ..., \gamma_{\text{max}}\}\). Note that \(\gamma_{\text{max}} = \infty\) is allowed.

The standard choice is to deem whichever model \(\mathcal{M}_\gamma\) minimizes the negative of the sum of the length associated to the probability of this index and the code length of the data under this model's universal code, i.e.,

\[\mathcal{M}_{\argmin_{\gamma \in \Gamma} \left[-\log \pi(\gamma) - \log \bar{p}_\gamma(z^n)\right]},\]

the best explanation of the given data \(z^n\). (This is mathematically equivalent to maximizing \(\pi(\gamma) \cdot \bar{p}_\gamma(z^n)\)).

We can explore what happens under various association rules. Fixing a particular association rule, one may compare different models on how well their associated universal distributions describe given data while under the constraint of minimizing code lengths.

Bayesian Universal Distribution

The following rule for associating to a statistical model its universal distribution is known as the Bayes Factor Method. Fix some "prior probability density" \(w_\gamma\) on the parameters in \(\Theta_\gamma\); a free choice called a prior. Next, we define the Bayesian marginal distribution relative to this choice:

\[p^{\text{BAYES}}_\gamma(z^n) := p_{w_\gamma}^{\text{BAYES}}(z^n) = \int p_\theta(z^n) w_\gamma(\theta) d\theta\]

as our \(\bar{p}_\gamma\). Remember, the point of our association rule is to summarize a model \(\mathcal{M}_\gamma = \{ p_\theta : \theta \in \Theta_\gamma\}\) by a single, universal distribution \(\bar{p}_\gamma\). The above Bayesian marginal distribution technique produces a probability density/mass function which assigns density/mass to a data tuple by taking a weighted average (with respect to the prior \(w_\gamma\)) of how much each of the \(p_\theta\) from \(\mathcal{M}_\gamma\) assign to that data tuple. This is a natural rule that takes into account all of the pdfs present in the statistical model. In summary, this section's association rule looks as follows:

\[\mathcal{M}_\gamma \rightsquigarrow p_{w_{\gamma}}^{\text{BAYES}}.\]

Example: Here we describe the Bernoulli model under the Bayesian marginal distributions association rule:

Fix \(n \in \mathbb{N}\). Let \(\mathcal{M}_{\text{B}} := \{ p_\theta : \theta \in [0,1]\}\) represent the statistical Bernoulli model composed of all Bernoulli measures on \(n\)-tuples, extending the length-\(1\) measures by independence. In particular, for any \(z^n \in \{0, 1\}^n\), the Bernoulli \(\theta\)-measure assigns mass

\[p_\theta(z^n) = \theta^{n_1} \cdot (1 - \theta)^{n_0},\]

where \(n_i\) is the number of \(i\)'s found in the coordinates of \(z^n\).

Suppose we choose \(w(\theta) \propto \theta^\alpha \cdot (1 - \theta)^\beta\) for some \(\alpha, \beta \in \mathbb{R}\). This means our Bayesian marginal distribution takes the form:

\[p_{\text{B}, w}^{\text{BAYES}}(z^n) \propto \int \theta^{\alpha + n_1} \cdot (1 - \theta)^{\beta + n_0} d\theta,\]

so definitely not a member of \(\mathcal{M}_{\text{B}}\).

We skip the motivation of this prior, postponing it to a later blog post.

Example: Now we present the Gaussian location statistical model.

Fix \(n \in \mathbb{N}\) and variance \(\sigma\) over data domain \(\mathcal{Z} = \mathbb{R}\). Let \(\mathcal{M}_{\text{GAUSS}}\) be the Gaussian location statistical model, or the family of

\[p_\theta \propto \text{exp} \left(\sum_{i = 1}^n \frac{(z_i - \theta)^2}{2 \sigma^2}\right).\]

We imagine our association rule as inspired by the previous example about Bayesian marginal distributions. To start, it is typical to take the prior in this scenario to be uniform: \(w(\theta) \equiv 1\). As this would not lead to a convergent integral if used directly (i.e., it is improper), we ask a weaker property of some statistical model \(\mathcal{M}_\gamma\) in order to define a useful universal distribution for this model in the Bayesian way; namely, that after some initial number of \(m\) observations, the Bayesian posterior \(w_\gamma(\theta \mid z^m)\) is integrable/proper. If so, we may set \(\bar{p}\) for \(\mathcal{M}_{\gamma}\) to be:

\[p^{\text{BAYES}}_\gamma(z^n) := p^{\text{BAYES}}_\gamma(z_{m + 1}, ..., z_n \mid z^m) \cdot p_0(z^m) := p_0(z^m) \int p_\theta(z_{m + 1}, ..., z_n) w_\gamma(\theta \mid z^m) d\theta,\]

where \(p_0\) is an arbitrary distribution on \(\mathcal{Z}^m\). The Gaussian model \(\mathcal{M}_{\text{GAUSS}}\) is an example of a model for which this can be done.

If we wish to perform a model selection across many different \(\mathcal{M}_\gamma\), we ask that the same \(m\) work for all, and apply the same \(p_0\) to all. We note that the choice of \(p_0\) will not affect the outcome of the argmin for best estimation.

Normalized (Penalized) Maximum Likelihood

Also known as the Shtarkov distribution, the Normalized Maximum Likelihood (NML) is another (our second) example of an association rule from models to single, universal distributions. Again, assume we have a statistical model \(\mathcal{M} = \{ p_\theta : \theta \in \Theta\}\).

We start by fixing a function \(v : \Theta \to \mathbb{R}^+_0\) called the luckiness function. We will see that there is one restriction on this function.

We define the NML distribution as follows:

\[p^{\text{NML}}_{v}(z^n) := \frac{\max\limits_{\theta \in \Theta} [p_\theta(z^n) \cdot v(\theta)] }{\int \max\limits_{\theta \in \Theta} [p_\theta(\zeta^n) \cdot v(\theta)] d\zeta^n}.\]

Actually, these maxima might not exist, so in general, one should write supremum. The denominator's only role is to normalize \(p^{\text{NML}}_{v}\); we could let the symbol \(T\) denote the value of the integral in the denominator, which we need to be finite; this gives us a bounding condition on \(v\). In particular, if \(v\) were constant, we'd have the reduced form for the universal distribution form NML:

\[p^{\text{NML}}_{v}(z^n) = \frac{1}{T} \cdot p_{\hat{\theta}_{\text{ML}}(z^n)} (z^n),\]

where we use the notation \(p_{\hat{\theta}_{\text{ML}}(z^n)}\) to denote the maximum-likelihood (ML) estimator for given \(\mathcal{M}\) at given data \(z^n\), which we assume to exist and be unique.

Taking a logarithm of the denominator, we obtain what is termed the model complexity of \(\mathcal{M}\):

\[\texttt{COMP}(\mathcal{M}; v) := \log T = \log \int \max\limits_{\theta \in \Theta} [p_\theta(\zeta^n) \cdot v(\theta)] d\zeta^n,\]

which reduces to \(\log \int p_{\hat{\theta}_{\text{ML}}(\zeta^n)}(\zeta^n) d\zeta^n\) whenever \(v\) is constant. Why do we refer to this as a measure of complexity? In the case that \(v \equiv 1\), the value of \(\texttt{COMP}(\mathcal{M}; 1)\) can increase like the logarithm of how many distributions are found in \(\mathcal{M}\). The luckiness function \(v\) simply biases the probabilities within the maximum term.

Because we only require that the luckiness function \(v\) be chosen as to make \(T\) finite means that \(v\) need not be chosen to be integrable, and hence might not be a probability density function itself.

We further define the MDL estimator based on \(v\) as follows (assuming it exists):

\[\hat{\theta}_v(z^n) := \argmax\limits_{\theta \in \Theta} p_\theta(z^n) \cdot v(\theta) = \argmin\limits_{\theta \in \Theta} \left[ -\log p_\theta(z^n) - \log v(\theta) \right].\]

This is the distribution-index \(\theta\) optimizing the cost, whether written in product or log form. We view the \(v\)-MDL estimator \(\hat{\theta}_v\) as a penalized ML estimator, attempting to maximize \(p_\theta(z^n)\) subject to a cost \(v(\theta)\). If \(v\) were to be constantly equal to \(1\), we'd be in the "unpenalized" case, which was the original context when Rissanen first studied MDL [Rissanen 96].

Despite how general our choice of luckiness function may be, it is common practice to choose \(v\) that are smooth so that for large-dimensional data, \(\hat{\theta}_v\) should be essentially indistinguishable from the unpenalized \(\hat{\theta}_{\text{ML}}\). Note that while the integral defining \(T\) is likely to converge over bounded outcome spaces (i.e., the domain space of the data being finite in cardinality or size), that integral is tricky to keep bounded when defined over unbounded outcome spaces. Therefore, a choice of non-uniform \(v\) helps us extend our studies to the unbounded outcome spaces, while choosing \(v \equiv 1\) is usually appropriate in the bounded case.

Now, turning to a family of statistical models, if we adopt the association rule provided by NML:

\[\mathcal{M}_\gamma \rightsquigarrow p^{\text{NML}}_{v_\gamma},\]

and assume a uniform distribution \(\pi\) over finite \(\Gamma\), the best explanation (see the definition above) for given data \(z^n\) would be the model \(\mathcal{M}_{\gamma_{\text{best}}}\) where

\[\begin{aligned} \gamma_{\text{best}} &:= \argmin\limits_{\gamma \in \Gamma} \left[ \underbrace{-\log \pi(\gamma)}_{\text{const}} - \log \left(p^{\text{NML}}_{v_\gamma(z^n)}(z^n)\right) \right]\\ &= \argmin\limits_{\gamma \in \Gamma} \left[ + \log T_\gamma - \min_{\theta \in \Theta_\gamma} \left[ - \log \left(p^\gamma_\theta(z^n)) \right) - \log(v_\gamma(\theta)) \right] \right] \\ &= \argmin\limits_{\gamma \in \Gamma} \left[ \texttt{COMP}(\mathcal{M}_\gamma; v_\gamma) - \log \left(p^\gamma_{\hat{\theta}^\gamma(z^n)}(z^n) \right) - \log\left(v_\gamma\left(\hat{\theta}^\gamma_{v_\gamma}(z^n) \right)\right) \right], \end{aligned}\]

where

\[\begin{aligned} \texttt{COMP}(\mathcal{M}_\gamma; v_\gamma) &= \log \int \max\limits_{\theta \in \Theta_\gamma} [p^\gamma_\theta(z^n) \cdot v_\gamma(\theta)] dz^n\\ &= \log \int p^\gamma_{\hat{\theta}^\gamma(z^n)}(z^n) \cdot v_\gamma\left(\hat{\theta}^\gamma_{v_\gamma}(z^n) \right) d\zeta^n. \end{aligned}\]

What are the trade-offs present in the final expression for \(\gamma_{\text{best}}\)? It seems that the MDL version of "best explanation" incorporates a trade-off between goodness-of-fit (as measured by maximizing the probability assigned to the data and maximizing the likelihood of the getting the maximizing distribution/MDL estimator) and model complexity (as measured by \(\texttt{COMP}\)).

Another worry for computer scientists here might be the resource-complexity for computing the \(n\)-fold integral inside of \(\texttt{COMP}\). But recent research has established a closed-form in many cases with the appropriate choice of \(v\) [Grünwald 19].

Example: Continuing the analysis of the Bernoulli model \(\mathcal{M}_{\text{B}} := \{ p_\theta : \theta \in [0,1]\}\), we follow not the Bayesian marginal distribution association rule, but the MDL association rule.

In this case, one can show that for fixed \(n\) and with choice of luckiness function \(v \equiv 1\),

\[\hat{\theta}_{\text{ML}}(z^n) = \frac{n_1}{n}, \qquad \text{and} \qquad \texttt{COMP}(\mathcal{M}_{\text{B}}, 1) = \log \sum_{n_1 = 0}^n \binom{n}{n_1}\left(\frac{n_1}{n} \right)^{n_1} \left(\frac{n_0}{n} \right)^{n_0} = \frac{1}{2} \log n \pm O(1).\]

One can further demonstrate that the resulting \(p^{\text{NML}}_v\) to be asymptotically indistinguishable from \(p^{\text{BAYES}}_{w_{\text{JEFFREYS}}}\), where \(w_{\text{JEFFREYS}}\) is a particular prior known as Jeffreys' prior:

\[w_{\text{JEFFREYS}}(\theta) \propto \sqrt{\left|I(\theta)\right|} = \theta^{-1/2} (1 - \theta)^{-1/2},\]

where \(I(\theta)\) is known as the Fisher information at \(\theta\).

So it seems that the Bernoulli model exhibits a correspondence between the two association rules/universal distributions under particular choices for luckiness/prior functions.

Two-part (Sub-) Distribution

This section details a third association rule which is really a special case of Normalized Maximum Likelihood.

Given an uncountable \(\Theta\), first discretize to a countable subset \(\ddot{\Theta}\), and equip it with a probability mass function \(w\). We will treat \(w\) as our \(v\) from NML, except it possesses this extra sum property. Define \(p^{\text{NML}}_w\) as in NML for the discretized \(\ddot{\Theta}\). Notice that

\[\begin{aligned} T &:= \int \max\limits_{\ddot{\theta} \in \ddot{\Theta}} p_{\ddot{\theta}}(z^n) w(\ddot{\theta}) dz^n\\&\leq \int \sum_{\ddot{\theta} \in \ddot{\Theta}} p_{\ddot{\theta}}(z^n) w(\ddot{\theta}) dz^n\\ &= \sum_{\ddot{\theta} \in \ddot{\Theta}} w(\ddot{\theta}) \left( \int p_{\ddot{\theta}}(z^n) dz^n \right) = 1. \end{aligned}\]

The idea now is to forget about the denominator, for we have shown that it is bounded from above by \(1\), so we attempt to circumvent computing \(p^{\text{NML}}_w\) directly by approximating by the sub-distribution:

\[p^{\text{2-P}}_w(z^n) := \max\limits_{\ddot{\theta} \in \ddot{\Theta}} p_{\ddot{\theta}}(z^n) w(\ddot{\theta}).\]

\(p^{\text{2-P}}_w\) will likely sum to strictly less than \(1\) across all inputs, meaning it does not qualify as a probability distribution. But we may imagine it putting the remaining probability in some forbidden, designated output not from the (discretized) outcome space. MDL is flexible enough to apply to probability distributions as well as sub-distributions (those "distributions" that don't add to \(1\)), but never those which add to more than \(1\) [Grünwald 19].

Prequential Plug-in Distribution

This section details yet another association rule which can be defined from existing estimators.

Suppose we've already settled on a reasonable estimator \(\u{\theta}\) for a given model \(\mathcal{M}\), i.e., a reasonable way to estimate which index \(\theta\) optimizing the model's ability to assign a high probability to a given piece of data \(z^n\).

We could then define

\[p^{\text{PREQ}}_{\u{\theta}}(z^n) := \prod_{i = 1}^n p_{\u{\theta}(z^{i - 1})}(z_i \mid z^{i - 1}).\]

In particular:

If the components of the data are all i.i.d., the probability inside the product simplifies to \(p_{\u{\theta}(z^{i - 1})}(z_i)\).
For the Gaussian location statistical model \(\mathcal{M}_{\text{GAUSS}}\) defined above, a reasonable estimator could be the ML estimator \(\u(\theta)(z^m) := \hat{\theta}_{\text{ML}}(z^m) = \sum_{j = 1}^m \frac{z_j}{m}.\)
The ML estimator should be avoided with discrete data, as then one of the factors in the product for \(p^{\text{PREQ}}_{\u{\theta}}\) could evaluate to zero, giving a zero product no matter how close to \(1\) the other factors might be, never offering a useful contribution to the selection process for the best explanation of data \(z^n\). To counteract this, the ML estimator may be smoothed in various ways.

Smoothing of the ML estimator can take many forms depending on the specific model. Grünwald proposes the following expression for a reasonable, smoothed estimator for the Bernoulli model \(\mathcal{M}_{\text{B}}\):

\[\u{\theta}_{\text{smoothed}}(z^m) := \frac{\frac{1}{2} + \sum_{i = 1}^m z_i}{m + 1}.\]

With some manipulations, one can show that the prequential universal distribution \(p^{\text{PREQ}}_{\u{\theta}_{\text{smoothed}}}\) is exactly identical to \(p^{\text{BAYES}}_{w_{\text{JEFFREYS}}}\). This correspondence is special to the Bernoulli (and each multinomial) model, but a relationship like this exists for other models in general.

More Association Rules

Further work has been done in the past few decades on even more options for universal distributions, including

The Switch Distribution: a particular type of nested model selection, which arguably works better than the preceding options.
The Universal Distributions based on the Reverse Information Projection: offer improved error bounds and stopping behavior in hypothesis testing.

Selecting an Association Rule

To decide which \(\bar{p}\) to associate to a given \(\mathcal{M}\), let us first define the fitness ratio for given data \(z^n)\) and given an arbitrary likelihood function \(v: \Theta \to \mathbb{R}^+_0\):

\[\text{FR}_v(\bar{p}, z^n) := \frac{\bar{p}(z^n)}{\max\limits_{\theta \in \Theta} p_\theta(z^n) v(\theta)}.\]

In the case of \(v \equiv 1\), the fitness ratio formula redues to

\[\text{FR}(\bar{p}, z^n) := \frac{\bar{p}(z^n)}{p_{\hat{\theta}_{\text{ML}}(z^n)}(z^n)}.\]

In this setting, the working postulate is that a better universal distribution is one for which the fitness ratio tends to be as large as possible. Because we restrict ourselves to probability distributions, we don't worry about overfitting. And we focus on finding distributions whose fits are proportional to the best-fitting distribution in \(\mathcal{M}\). We can formalize our search by asking the fitness ratio to be large in the worst-case: setting

\[\bar{p} := \argmax_{\bar{p}} \min_{z^n \in \mathcal{Z}^n} \text{FR}(\bar{p}, z^n).\]

One result is that this maximin problem has a solution iff the model complexity \(\texttt{COMP}(\mathcal{M}, v)\) is finite. This solution would actually be equal to \(\bar{p}^{\text{NML}}\). This gives special status to the NML distribution as the "most robust choice of universal \bar{p}" [Grünwald 19] .

Historically, for this reason, MDL was considered the gold-star of association rules, while the rest were only to be used if computing MDL was too expensive, i.e., for pragmatic reasons. Grünwald showed that all of the above-defined association rules produce universal distributions which are asymptotically off only by a constant factor \(\pm O(1)\) in their fitness ratios (possibly in the worst-case or in some weaker expectation sense). So one may use whichever rule is closest to NML while reasonable to compute.

However, this property of MDL is not the only desirable property for an association rule to possess; other rules can have even higher fitness ratios for subsets of the data (but doing more poorly overall), and offer other specialized properties [Grünwald 19].

Sources

[Grünwald 19] Peter Grünwald's Minimum Description Length Revisited.
[Rissanen 78] J. Rissanen's Modeling by the Shortest Data Description.
[Rissanen 96] J. Rissanen's Fisher information and stochastic complexity.
[Rasmussen and Ghahramani 00] C.E. Rasmussen and Z. Ghahhramani's Occam's Razor.