Wrapping Our Heads Around Entropy

Wrapping Our Heads Around Entropy

Entropy – the Most Powerful Force in the ‘Verse:

 

Actually, that’s not quite true. The most powerful force in the ‘verse is free energy minimization. However, entropy is half of the free energy equation, and it’s usually the more complex half. So, if we understand entropy, then we can understand free energy minimization. If we understand free energy minimization, then we understand all the energy-based machine learning models, including the (restricted) Boltzmann machine and one of its most commonly-used learning methods; contrastive divergence.

If we can wrap our heads around entropy, then we also clear our way to understanding more complex and sophisticated energy-based machine learning methods, beginning with expectation maximization, and moving on to variational Bayes. Variational Bayes is one of the leading methods for inference, and inference is what we need to deal with situations that we’ve never seen before. For example, inference is essential for autonomous vehicles, because they’ll inevitably encounter new situations and have to infer what they’re seeing and what they have to do based on previously-learned situations and responses. The difference is that inference goes beyond the previously-learned material.

In short, entropy is the gateway to all the advanced, interesting machine-learning methods.

For us, being able to internalize the entropy equation is like Ulysses getting through the whirlpool Charybdis. It’s like Luke Skywalker entering into the cave at Dagobah. It’s where we meet the dark side of the Force.

Now, we’re not going to try to learn inference methods today. We’re going to focus on the basics; on entropy.

Jedi Master Yoda, a fictional character presented in <a href="https://en.wikipedia.org/wiki/The_Empire_Strikes_Back" data-recalc-dims=
The Empire Strikes Back, produced by LucasFilm, Ltd. This low-resolution image is reproduced here for educational and discussion purposes, within the realm of the Fair Use Act.

Trying to learn inference before we have the basics (like entropy) would be like trying to lift an X-wing fighter out of Dagobah’s muck without first having learned to meditate and focus our minds.

(For a quick look back at the scene, see: Yoda lifting the X-wing fighter out of the muck at Dagobah.)

Practically speaking: many of us venturing into machine learning try to skim by the math and physics fundamentals underlying the energy-based models. That’s like trying to take on the Empire without having spent time mastering the Force.

So what we’re doing here and now is starting a period of Force-training, and we begin with the very basics. We start with entropy; what it is and how to use it.

 
Italian-renaissance-border-2-thin
 

Simple Entropy

 

Our basic entropy definition is:

\bar{S} =  - \sum\limits_{i=1}^I x_i Lf(x_i),

where I is the total number of possible energy states in the system. For a simple bistate system (e.g., where units or “neurons” can be either “on” or “off”), I = 2. Then, x1 refers to the fraction of units in the “on” state, and x2 refers to the fraction in the “off” state. Also, because we’re dealing with just two states, x1 + x2 = 1.0.

This is the entropy when all that we’re counting are the number of units in specific states. When we have just two states, we sum over just those two state-possibilities, which is why the index i in the above equation will run as i=[1..2] in the next equation.

If we were to write out the longer form for the simple entropy equation, it would look like

\bar{S} =  - [x_1 \ln (x_1) + x_2 \ln (x_2)].

Please note that we’re dealing with something called “reduced entropy” (and in other related discussions, with “reduced enthalpy” and “reduced free energy”); these are indicated by putting a bar over the term. (See the bar-S above.) This just means that we’ve divided the original entropy by both N, the total number of units in the system, and by k; Boltzmann’s constant. This just gives us a simpler, dimensionless equation with which to work. Neither N nor k are shown as we get to this equation; we’re sort of jumping into the middle.

The logarithmic function; y = ln(x). Note that y is negative for any value of x between 0 and 1. The log of 1 is zero; ln(1) = 0.
The logarithmic function; y = ln(x). Note that y is negative for any value of x between 0 and 1. The log of 1 is zero; ln(1) = 0. The log of 0 is undefined; it approaches negative infinity.

Notice that there is a minus sign at the beginning of the entropy equation. This minus sign is present at the beginning of every entropy equation, no matter how we write it.

When we think about it, this makes sense. The entropy equation involves taking the (natural) logarithm of fractional numbers. (Both x1 and x2 are less than 1; recall that x1 + x2 = 1.)

The logarithm of a fraction (any number between 0 and 1) is negative. (Remember that the logarithm of 1 is 0, the logarithm of 0 is undefined, but approaches minus infinity, and thus the log of any number between 0 and 1 is going to be negative.)

So, without the minus sign, we’d be dealing with the sum of (negative) terms. Each term would have a fraction (e.g., x1), multiplying the logarithm of that fraction, and the log would be negative. Thus, we’d be summing negative numbers.

The Second Law of Thermodynamics.

However, the Second Law of Thermodynamics states that the entropy of a (isolated) system can only increase; it can never decrease.

We increase entropy by spreading out the distribution of units among the available states. In our simple bi-state system, this means more like “half-and-half”, and less like going to extremes (an extreme would be mostly “on” and fewer “off,” or vice versa).

If we’re going to increase entropy, we need to have an equation where we move to a maximum as we make the distribution over states more even, that is make x1 closer to x2. We do that by putting the minus sign in front.

See the figure below, which shows that entropy reaches a maximum when x1 = x2 = 0.5.
(Note that when x1 + x2 = 0.5, we can write “x” for “x1”, and “(1-x)” for “x2.”)

The entropy for a bistate system is at a maximum when x1 = x2 = 0.5.
The entropy for a bistate system is at a maximum when x1 = x2 = 0.5.

However, we don’t just want to maximize entropy (important though that is).

What we really want is to minimize free energy.

Our free energy equation looks like

\bar{G} = G/NkT = \bar{H} - \bar{S}.

This tells us that our (reduced) free energy is a combination of two terms. The first is the (reduced) enthalpy, which is the energy contributions both from units being in their various activation states, as well as from their interactions with each other. The second is the entropy.

We’ll deal with the enthalpy (the energy from unit activations and interactions) some other time, some other day.

Right now, we’ll focus in on the entropy.

If we want to minimize the free energy, and the free energy is a sum of “something” (enthalpy) plus something involving the entropy, then we need a negative sign in front of the entropy.

In short, for the entropy to contribute to free energy minimization, it needs to be bowl-shaped, with the bowl pointing up – as opposed to its normal formation, with the bowl pointing down.

Thus, by the time that we bring the entropy term into the free energy equation, we get two minus signs; the minus that we need for the free energy to move towards an equilibrium (minimum) state, and the minus sign intrinsic to the entropy itself. Multiplying each other, they cancel out – so those negative values that come from taking the logarithms actually show up in the final term values.

We call this the negative entropy, or negEntropy, or negS (where S stands for entropy).

Here’s the negEntropy diagram, again for a simple bistate system (so it’s just the reverse of the previous figure).

The negative entropy (negEntropy, or negS) for a simple bistate system.
The negative entropy (negEntropy, or negS) for a simple bistate system.

Now, here’s the interesting thing from the whole graph – and the point of today’s discussion.

Any natural system, left to its own resources, will move towards a free energy minimum. Barring the enthalpy-influence (which we’ll discuss another day), it will move towards the lowest point in the negEntropy curve.

Energy-based machine learning algorithms also operate on moving towards a free energy minimum, meaning that they’re moving towards the same central point – again, as modified by enthalpy.

Although the enthalpy term will be able to shift where the system winds up a little bit from side-to-side, one thing clearly stands out. In fact, this is a key role of the entropy, as far as we’re concerned.

At the extrema points (where x1 approaches either 0 or 1), the negEntropy approaches zero.
At the extrema points (where x1 approaches either 0 or 1), the negEntropy approaches zero.

The entropy keeps the system from going to extremes.

What that means is simply this: as enthalpy influences the free energy, so that final free energy minimum is somewhere near (but not precisely at) the entropy minimum, we can see that we might get a value of x1 = 0.4, or maybe x1 = 0.3. A value of x1 = 0.2 is really pushing it.

And a value of x1 = 0.1? (Or conversely, x1 = 0.9?)

Not so likely.

The reason is that if x1 = 0.1, the negEntropy value is really starting to go up the steep side of the negEntropy-diagram. It’s getting very far from the central area, where the minimum is to be found.

More to the point, as x1 approaches zero (or conversely, as x1 approaches one; the two sides are identical), then the negEntropy approaches zero. The following table illustrates this, without our doing mathematical proofs. (Those would be easy enough, but we just want to get a sense of what’s going on here.)

At the extrema points (where x1 approaches either 0 or 1), the negEntropy approaches zero.

To break it down just a little bit (using the details from the above table), it’s clear that as x1 approaches zero, its logarithm will become an increasingly large negative number. However, this log multiplies x1, and the result is very small; close to zero. At the other side, x2 = 1 – x1 approaches one, and its logarithm approaches zero. So we have a number that’s close to one multiplying a number that’s close to zero; again, we get a number close to zero. So we’re adding two numbers that are close to zero, and the result is thus also close to zero.

The summary story is: our negEntropy will be a bowl-shaped curve. The negEntropy values near the extrema of x1 approaching either zero or one will be close to zero. The middle value for the negEntropy is fixed; it’s what we get when x1 = x2 = 0.5.

In short, there’s not much mystery to this entropy equation. It is a fixed-value equation; the entropy (and likewise, the negEntropy) does not depend on any parameters.

The free energy, which does depend on enthalpy parameters (activation and interaction energy coefficients) can vary; this means that the free energy will vary as a function of those parameters.

We’ll want to study how the enthalpy terms in the free energy equation work; that’s a topic for another day. For now, though, we’ve got a good handle on the entropy.

 
Italian-renaissance-border-2-thin
 

Immediate Practical Implication for Neural Networks

 

A neural network that uses an energy-based model for achieving good connection weights will seek to have at least one - but not all - of the hidden layer nodes in the "on" state for each pattern that it learns. Preferentially, it will move towards having about half of its hidden layer nodes "on" for any given pattern - allowing (of course) for the influence of the enthalpy parameters (which can push it towards a not-even split of "on" and "off" hidden layer nodes).
A neural network that uses an energy-based model for achieving good connection weights will seek to have at least one – but not all – of the hidden layer nodes in the “on” state for each pattern that it learns. Preferentially, it will move towards having about half of its hidden layer nodes “on” for any given pattern – allowing (of course) for the influence of the enthalpy parameters (which can push it towards a not-even split of “on” and “off” hidden layer nodes).

From the preceding discussion, we know that any neural network that relies on an energy-based approach to getting its connection weights will be trying to find the free energy minimum. Since the right-hand-side of the free energy equation is all about entropy, we can see that the network will shoot for getting something in the range of x1 = x2 = 0.5; this means, it will try to maximize its entropy (minimize its free energy) by getting half the hidden layer nodes into the “on” state, and half of them “off.”

Of course, this will all get permuted by the enthalpy parameters. It’s not very likely at all that we’ll get a 50:50 split in our “on” and “off” states.

However, we know that the network will try to have at least one hidden layer node that is “on,” and conversely, it will also try to have at least one hidden layer node that is “off.” It will try to avoid the extremes. Our enthalpy parameters may push it to an extreme, but the entropy will try to keep it to the literal “middle-of-the-road.”

 
Italian-renaissance-border-2-thin
 

Relevance to the Rest of Machine Learning

 

So now, we have a handle on entropy.

It’s useful right now to step back and see what this has just bought us.

Entropy is one of the key concepts for machine learning, as shown in the following figure.

Seven equations from statistical mechanics and Bayesian probability theory that you need to know, including the Kullback-Leibler divergence and variational Bayes.
Seven equations from statistical mechanics and Bayesian probability theory that you need to know, including the Kullback-Leibler divergence and variational Bayes. We’ve just addressed the third important equation; entropy. Notice that in the figure, the enthalpy is represented as U, and in the previous text, as H. Both notations for enthalpy are common.

We have an important component for understanding the Ising equation, which is the theoretical jumping-off point for the Hopfield neural network and the original Boltzmann machine. We’re in a very good place to start studying energy-based methods, including expectation maximization, as that relies on free energy minimization.

There are a couple of important topics that we haven’t covered. One is how we get the notion of entropy in the first place; this is represented by points (1) and (2) in the preceding figure. The key thing to understand there is that entropy – representing distribution over all possible states – is highly related to the notion of microstates (which tells us how many states there are to distribute over). I haven’t read many (ok, I haven’t read ANY) good, easy-to-follow discussions of microstates.

That’s why I wrote my own, and if you haven’t completed a graduate-level course in statistical mechanics, you might want to sign up for my machine learning mailing list. It starts off with microstates, and takes you through the whole concept, over an eight-day period.

You can sign up using the form just below.

There are a few other things that we need to consider – such as how the enthalpy (activation and interaction energies) work, and how we can derive the free energy equation. (From the previous figure, you can see that it shows up in two distinct forms; we’d like to get one from the other).

The first step, though, is to really understand microstates. If you haven’t done so yet, get on top of this important foundation by using the Opt-In form below.

 

Italian-renaissance-border-2-thin
 

Opt-In to get your Precis and Bonus Slidedeck – Right HERE:






Opt-in to access Seven Essential Machine Learning Equations

We respect your email privacy

 

 
Italian-renaissance-border-2-thin

 

Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War

 
Italian-renaissance-border-2-thin
 

Reward for Getting This Far

 

OK, this joke is hugely sexist. However, 90% of you, my dear Readers, are guys. You will totally get this, and by now, you can probably understand the punch line. Keep in mind: endothermic means absorbing heat. Exothermic means producing heat. And oh yeah, you probably also need to review freshman chemistry on Boyle’s law.

Having said that, the joke is: The Situation in Hell.

 
Italian-renaissance-border-2-thin
 

Previous Related Posts

 
Italian-renaissance-border-2-thin
 

5 thoughts on “Wrapping Our Heads Around Entropy

Leave a Reply

Your email address will not be published. Required fields are marked *