The 1-D Cluster Variation Method (CVM) – Simple Application

The 1-D Cluster Variation Method (CVM) – Simple Application

The 1-D Cluster Variation Method – Application to Text Mining and Data Mining

There are three particularly good reasons for us to look at the Cluster Variation Method (CVM) as an alternative means of understanding the information in a system:

  1. The CVM captures local pattern distributions (for an equilibrium state),
  2. When the system is made up of equal numbers of units in each of two states, and the enthalpy for each state is the same (the simple unit activation energy is zero), we can derive an analytic solution for the configuration (pattern) variables at the free energy minimum, and
  3. We can thus express pattern distributions in the equilibrium state as a function of a single interaction-energy parameter h. This can help characterize the patterns, and even characterize (more broadly) the context from which the data was derived.

This latter point is particularly important.

When we select a dataset with equal numbers of units in states A and B, then we can characterize the pattern distribution of these units by virtue of a single parameter.

This is a new information measure.

In particular, we get a descriptive information measure based on the fact that the system is at equilibrium.

We don’t need (Bayesian) conditional probabilities.

We do obtain a number that helps characterize the system, and actually positions it in a phase space.

Specifically, for equiprobable unit state distributions at equilibrium, we position the system on the smoothly continuous curve that gives the pattern distributions as a function of interaction energy.

This gives us a measure of context; a means of anticipating certain kinds of patterns, if not the exact patterns themselves.

In situations where it is difficult to determine Bayesian priors, this can be an essential clue for pattern recognition, data mining, text mining, and numerous other tasks.

The following presents a brief illustration of how this works.

A Brief Historical Background

Years ago, there was a fundamental split in how we thought about entropy.

Claude Shannon’s pivotal work – leading to the whole realm of information theory – led us to extract the entropy term out of the total free energy formalism and consider it in isolation.
The role of enthalpy (the energy term comprising both unit energy and interaction energies between units) became subsumed in conditional probabilities within information theory.

This led to numerous breakthroughs, and influenced generations of researchers.

What might have been lost during this time, however, was the whole notion of free energy minimization – of reaching an equilibrium state.

Shannon wrote his first pivotal work in 1948.

Some three years later, Ryochi Kikuchi developed an equally compelling – although far less well-known – approach to modeling entropy; the Cluster Variation Method (CVM).

The CVM helps us look at the entropy of a system as not just the distribution of units into singular energy states, but also to account for the local patterns of units.

The full 1-D and 2-D CVM equations are presented in two Technical Reports (see references at the end of this post).

Italian-renaissance-border-2-thin

A Simple Illustrative Application

Consider a text string: Please like me on Facebook thank you.

With spaces removed, this is a 30-character string with equal numbers of consonants (C) and vowels (V). In short, we have an exemplar data string that satisfies our condition for applying the analytic 1-D CVM solution: a data string with equiprobable distributions into states C and V.

To apply the 1-D CVM, we arrange these units in a single zigzag chain, as shown in the figure below.

1-D CVM application to a simple text string.
1-D CVM application to a simple text string.

For reference, and description of the fraction variables (the various z’s), as well as a summary of the 1-D CVM equations, see:

The figure above shows the exact distribution for each of the configuration patterns.

To obtain the fraction variables (the set of z’s) for this illustration, we divide each of these distributions by the total number of configurations. (Note: I have set this up – similar to the illustration in Visualizing Configuration Variables with the 1-D CVM – as a 1-D wrap-around. Thus, the number of configurations is equal to the number of units, which in this case is 30.)

This result – the set of actual z(i) variables – is shown in the following figure.

The fraction variables z(i) for a simple 1-D CVM text string illustration.
The fraction variables z(i) for a simple 1-D CVM text string illustration.

We notice that – even though the analytic solutions are carried out with the equivalence assumption that z(1) = z(6), z(2) = z(5), and z(3) = z(4), we do not have precisely this equality within our data. Instead, z(1) = 0.067 and z(6) = 0.033, and z(2) = 0.233 while z(5) = 0.267.

To call the analytic solution into play, we need equivalence.

Thus, we take the averages of z(1) and z(6), and of z(2) and z(5). We justify this on the rationale that we are using an extremely small dataset, and that a larger (and more realistic) dataset would likely result in the desired equivalences.

This gives us the averaged equivalences, shown in the following table.

The z(i) variables with averaging between z(1) and z(6), and between z(2) and z(5), for the simple 1-D CVM illustration.
The z(i) variables with averaging between z(1) and z(6), and between z(2) and z(5), for the simple 1-D CVM illustration.

We immediately notice that this is not the kind of distribution that we would get if the interaction energy (between units in states C and V) was zero.

In fact, the following table recapitulates (from CVM I: 1-D Single Zigzag Chain: Basic Theory) what the expected fraction variables would be if there were no interaction energy, versus what we actually find in this example. (Notice that this table only deals with fraction variables z(1), z(2), and z(3) – we’ve already assumed equivalence between z(1) and z(6), z(2) and z(5), and z(3) and z(4).)

Observed fraction variables z(i) versus those expected when the interaction enthalpy is zero.
Observed fraction variables z(i) versus those expected when the interaction enthalpy is zero.

Italian-renaissance-border-2-thin

Interpreting the Results

From the last table of the preceding section, we immediately understand that what we have is something like an anti-ferromagnetic system. Instead of like units being preferentially next to like, they are preferentially next to not-like.

This, when we think about the structure of the English language, makes perfect sense.

Our language tends to (preferentially) put vowels between consonants, rather than having run-on strings of consonants and run-on strings of vowels. In fact, it is somewhat difficult (in English) to obtain run-on strings of vowels, and only slightly less difficult to obtain run-on strings of consonants.

The question that we want to ask ourselves now is: where in the spectrum of ferromagnetic-like and anti-ferromagnetic-like systems does this particular example fall?

For the answer, we consult our graph of the analytic solution of the configuration cluster variables as a function of the interaction energy, previously presented in Analytic Single-Point Solution for 1-D Cluster Variation Method Variables. The graph is replicated below.

Cluster Variation Method - Analytic solution of single zigzag chain when x1=x2=0.5.
Cluster Variation Method – Analytic solution of single zigzag chain when x1=x2=0.5.

From a table that generated the results of this graph, I have:

  • When z(1) = 0.054, h = 0.7, and when z(1) = 0.035, h = 0.6, so that simple interpolation gives us a (localized fit) straight line equation of z(1) = 0.19(h) – 0.079, or h = 0.68 when z(1) = 0.5.
  • When z(3) = 0.225, h = 0.7, and when z(3) = 0.186, h = 0.8, so that simple interpolation gives us a (localized fit) straight line equation of z(3) = -0.39(h) + 0.497, or h = 0.76 when z(1) = 0.5.

There is a disparity, obviously. This comes because the actual data points are skewed; I averaged results across z(1) and z(6), z(2) and z(5), and z(3) and z(4).

But, just to carry out the illustration, let’s take the average of the two results for h, as h = 0.72. This would give us expected results for z(1) = 0.058 and z(3) = 0.217. The comparison for several values of the interaction enthalpy term h is given in the table below.

1-D CVM fraction variables z(1) and z(3) for interaction parameter h = 0.72, 074, and 0.76 - in the neighborhood of observed fraction variable results.
1-D CVM fraction variables z(1) and z(3) for interaction parameter h = 0.72, 074, and 0.76 – in the neighborhood of observed fraction variable results.

The table shows increasing values for z(1) (from 0.58 to 0.67) as h increases from 0.72 to 0.76, and decreasing values for z(3) (0.217 to 0.201) for the same extent of h.

We keep in mind that h is not the interaction energy itself, rather (see the graph above):

CVM_1-D_def-for-h_2014-06-25_crppd

As an approximate range for h, we thus have h = 0.72 – 0.76 for this particular illustration. This is clearly in the realm of anti-ferromagnetic-like behaviors.

What This Means

The example just presented does two things:

  1. A simple, easy-to-follow instance of how the 1-D CVM can be applied to analyzing a sequential bistate data series, when the number of items in each state is approximately the same,
  2. An illustration of how this method provides a single-parameter result (or in this case, a parameter range), typifying the overall pattern distribution of the data.

There are many instances of text and data analysis where this method can prove useful. One such application would be in entity determination in text mining, where this method provides an alternative to Bayesian prior probabilities for determining the likely entity-type for a given entity in the text sequence.

In such an instance, it offers an alternative to the well-known information-theoretic methods, such as Kullback-Leibler.

From a different point of view, and perhaps more important for future work, the potential usefulness of this method lies in providing a context for data mining.

We are very aware that correct parameter selection (Bayesian priors, or even feature vector element selection and weighting) is very much dependent on the context from which a data source was extracted. It is often difficult to measure context.

The interaction enthalpy parameter, reconfigured as the parameter h, presents a simple, one-parameter measure that can help characterize the context of different data sets. It can be useful as a mapping between a context vector and a simple parameter (range). (This will not necessarily be a one-to-one mapping.) Then, the parameter h can be further used to associate a feature vector element weighting scheme with the next analytic step. This can include selection of appropriate Bayesian priors.

Thus, this simple example shows how the Cluster Variation Method is both an alternative means of characterizing a sequentially-organized data item (a single text document), or even a collection of such data items. It could be used to help partition a collection into subgroups for later analysis. It provides not only a characteristic of and insight into the local pattern distribution, but also potentially a means of providing context, which can influence further analysis.

Italian-renaissance-border-2-thin

Further Reading

Yedidia, Jonathan S., Freeman, William T., and Weiss, Yair (2002). Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms, TR-2002-35 August 2002, MITSUBISHI ELECTRIC RESEARCH LABORATORIES, http://www.merl.com.

Abstract:
Important inference problems in statistical physics, computer vision, error-correcting coding theory, and artificial intelligence can all be reformulated as the computation of marginal probabilities on factor graphs. The belief propagation (BP) algorithm is an efficient way to solve these problems that is exact when the factor graph is a tree, but only approximate when the factor graph has cycles.

We show that BP fixed points correspond to the stationary points of the Bethe approximation to the free energy for a factor graph. We explain how to obtain region-based free energy approximations that improve the Bethe approximation, and corresponding generalized belief propagation (GBP) algorithms.

(With permission, MERL.)

Related Posts and Technical Reports

More useful readings in statistical mechanics and information theory.

Leave a Reply

Your email address will not be published. Required fields are marked *