What We Really Need to Know about Entropy

February 28, 2018 AJMaren Comments 3 comments

There’s This Funny Little “Gotcha” Secret about Entropy:

Nobody mentions this secret. (At least in polite society.) But here’s the thing – entropy shows up in all sorts of information theory and machine learning algorithms. And it shows up ALONE, as though it sprung – pure and holy – from the head of the famed Ludwig Boltzmann. What’s wrong with this is that: entropy never lives alone, in isolation.

In the real world, entropy exists – always – hand-in-hand with another great force, called enthalpy. These two, combined together, give us free energy – that which is minimized throughout nature. (Yes, free energy minimization is THE FORCE.)
Here’s our basic free energy equation once again

$F = H - TS.$

In information theory and other machine learning applications of entropy, it seems as though the (reduced) entropy term is "cut off" from the whole (reduced) free energy equation. — In information theory and other machine learning applications of entropy, it seems as though the (reduced) entropy term is “cut off” from the whole (reduced) free energy equation. (“Reduced” is indicated by the bar over the terms; we’ll discuss that in the next two sections. “Reduced” is also why the “T” is missing; it’s been “reduced” out of the equation.)

This equation states that the free energy (F) is equal to the enthalpy (H) minus the temperature (T) times the entropy (S).

The really odd, strange thing is that in many discussions about information theory (and related topics), we present the entropy as though it exists in isolation – as though it has been “cut off” from the rest of the free energy equation.
As an aside: you may have come across the free energy equation in the form that is more often used in machine learning circles, where the authors introduce the “energy” of the system. Then they say that the free energy is found by taking the logarithm of the partition function, and will give an equation that looks like the one in the following figure.

The free energy, F, is given as the negative of Boltzmann's constant (k) times temperature (T) times the logarithm of the partition function, here called Z, and sometimes called W or the Greek Omega. The partition function involves the sum over all possible microstates of the energies of the units inhabiting that particular microstate. — The free energy, F, is given as the negative of Boltzmann’s constant (k) times temperature (T) times the logarithm of the partition function, here called Z, and sometimes called W or the Greek Omega. The partition function involves the sum over all possible microstates of the energies of the units inhabiting that particular microstate.

There’s a derivation from the equation in the above figure that leads to the first equation in this post; we’ll reserve that for another time and place. (Most likely, as a section in a chapter of my book-in-progress, Statistical Mechanics, Neural Networks, and Machine Learning.) It’s just that … if you’ve seen the partition function version of the free energy equation, don’t take it amiss that we’re working with a different way of expressing free energy now. They really are the same thing.

Now bear with me for a little, ok? Let’s go through some preliminaries, because we want to simplify this equation just a bit.

Getting Clear on Units and Notation

Just to make things perfectly obscure: depending on the author, the free energy may also be denoted by G (for Gibbs free energy) or A (for Helmholtz free energy); the distinctions between Gibbs and Helmholtz free energies have to do with keeping constant certain physical system values such as temperature, volume, and pressure. Since we’re thinking about free energy in a more abstract sense, we drop any terms that differentiate Gibbs and/or Helmholtz free energies from the simple equation above.

Also, the enthalpy H may also be denoted (and even described) as the system energy E, or even as U, for chemical potential. This means, you might read an author introducing the enthalpy-part of the free energy and describing this as the “energy of the system.” That’s ok; the author is indeed identifying the energy (really, the enthalpy) – just don’t get that mixed up with the free energy; these are very different beasts. Or more appropriately, energy (enthalpy) is part of the free energy equation, just as is entropy. Also, sometimes authors refer to the energy (enthalpy) as the “Hamiltonian” of the system. All correct, and it involves more physics than we want at this moment.

Also, sometimes – especially in machine learning circles – certain authors will use H for the entropy, instead of S.

Now that we’ve got the notation all muddled up, let’s talk about units, as in units-of-measurement.

This is terribly important, because the equation above is in the units of energy, e.g., joules.

Makes sense, right? After all, it is an energy equation.

However, for machine learning applications, we’re not interested in those aspects of the physical world that contribute to real, physical energy. We’re doing something more abstract.

For machine learning applications, we want a dimensionless free energy equation.

Thus, we want to get rid of any distracting real-physical-world elements.

To do this – and this is especially important in machine learning circles – we often divide the entire free energy equation through by the temperature T (which refers to absolute temperature, in degrees Kelvin, if you really wanted to know).

Actually, we usually divide through by NkT, where N is the number of units in the system, k is Boltzmann’s constant, and T is (of course) the temperature. (Side note: k, often written with the subscript beta (for Boltzmann), has units of joules/Kelvin, where joules is an energy measure, and Kelvin refers to a temperature measure, so when we multiply k*T, we get a units multiplication of (joules/Kelvin)*Kelvin, meaning that the result is in terms of an energy measure, joules.)

You didn’t see the constant k in the equation above; that’s because it was part and parcel of the true entropy equation, given as

$S = - k_{\beta} \sum\limits_{i} P_i \ln P_i,$

where P(i) is the probability of the total number of units occurring in a given microstate. (The sum is over microstates, not numbers of units or energy levels. If this seems at all obscure, please go to the Themesis “About” page, go way down to the BOTTOM of the page, do the Opt-In form, and you will soon (if not right away) get an invitation to Opt-In again (differently) to receive a new PDF, Top Ten Terms in Statistical Mechanics, which will be coupled with a series of autoresponder emails. They’ll give you a lot of information about microstates and the partition function, leading towards entropy. Lots of diagrams. The EASIEST introduction to microstates and the partition function, ANYWHERE, EVER. Promise!)

We can extract N, the total number of units in the system, using the relation

$P_i = N p_i,$

where p(i) refers to the probability – as a fraction of the total – for units appearing in a given configuration. (This really means, appearing in a given microstate, which deals with numbers of units in different energy levels.)

This gives us

$S = - k_{\beta} N \sum\limits_{i} p_i \ln p_i .$

We insert this entropy equation into the total free energy equation above, and divide through by NkT. The result is we get the whole equation down to something that is dimensionless – and has the total number of units removed, as well. This makes life so much simpler. It’s all now just about fractions of units in different energy states.

We traditionally put a “bar” over these new dimensionless terms, and call this the reduced free energy equation.

Here’s an example of a reduced free energy equation; I give some specifics for the reduced enthalpy (its the two terms involving epsilons), and also note that the Greek Omega in the following equation refers to the total number of microstates in the system. Here’s a nice little online summary of some key thermodynamic equations. If you’re up for a little more, here’s a Coursera presentation on entropy and the partition function. Also, just for simplicity, I’ve made it a bi-state system. (The same reasoning would apply if we wanted to have more than two possible energy states.)

(Also, I slip into using A for the free energy F … I wrote this a little while ago, in a different context.)

The classic Ising equation for a bistate system gives the (reduced) free energy, bar-A, as the enthalpy minus the entropy.

Understanding now what we mean with these “bar” terms, let’s go back to our original equation. A “bar” over anything means:

It is dimensionless in terms of energy, temperature, and other unit-measures often found in physics equations, and
It has the dimensions of total numbers of units removed as well; we’re now just talking in terms of fractions of units in various possible energy states.

Here’s our final, ultra-simplistic result

$\bar{F} = \bar{H} - \bar{S}.$

What This Really Means

What the previous equation just told was, if we put this into words (and understand that we’re talking about totally dimensionless and system-size-free values):

The (reduced) free energy of a system is its (reduced) enthalpy minus its (reduced) entropy.

Gosh, that’s awful simple, isn’t it?

And there’s a LOT of meaning packed into that very basic little equation.

As I’ve mentioned before (many times), free energy minimization is one of the great driving forces of the universe. It’s right up there with gravity.

We use free energy minimization not only to understand natural processes, but also to play certain more advanced mind-games in machine learning.

Thus, maybe we need to ask ourselves: what’s going on when we see entropy all by its lonesome self?

In fact, how can entropy possibly occur on its own? Doesn’t there have to be some enthalpy term, someplace?

Well, yes, there does.

What happens is a bit sneaky.

I mean, brilliant. But also just a bit sneaky.

To understand this, let’s look at one of the most famous instances of entropy-seemingly-on-its-own; Claude Shannon’s famous work on information theory.

Entropy and Information Theory

Claude Shannon founder of information theory, made the brilliant and intuitive leap from thermodynamics to representing information likelihoods.

Here’s the essence: Shannon developed a unique insight into determining the likelihood of what letter (or code unit) would occur in a sequence of letters, all based on their likelihood of occurrence dependent on what had just previously occurred (Bayesian probability).

Ignoring for now all the mathematical derivations, what is important is: if we’re going to make an error-free identification of some element in a sequence, we need to know what led up to that element.

Shannon’s work was a brilliant fusion of both statistical mechanics (e.g., entropy theory) and also Bayesian probability thinking. (See my previous post, A Tale of Two Probabilities, for a discussion on probabilities from two different perspectives; statistical mechanics and Bayesian.) These are enormously different animals, and Shannon was the first to bring them together into a meaningful whole.

The notion of fusing these two different schools of thought into a single formalism required huge mental agility; it’s creating an intellectual hybrid – sort of like creating a wolfdog, only more unusual.

Now, the thing that we need to pull out of all the complex mathematics is that Shannon worked with the probability that a certain letter would occur, given a prior sequence. For example, if the prior sequence was “gh,” then one likely next letter would be “t,” as we’d find in words like “night” and “weight.”

However, another possible next-letter would be a blank space, given that we also have the English words “nigh” and “weigh.”

Implementing Shannon’s ideas – which were crucially important as they were developed during WWII – took a lot of manual counting of letter combinations; bigrams, trigrams, and more.

At the end, there was a probability for every single N-gram of interest that would help with the “noise reduction” process; the first main application of Shannon’s thinking.

Think about it. We’d have to use that entropy equation that we had above – the one involving p(i) * ln(p(i)).

Suppose that we just worked with trigrams, and let’s say that our set of base elements was 26 letters plus a “space,” for 27 possible units. Then there would be 27x27x27 distinct possible combinations (each of which needed a probability assignment), for a total of 19,683 possibilities.

Not all of these would be likely, so we’d have to pad the numbers a bit. (Remember, we couldn’t assign a probability of zero, because then we’d have to take the log of zero, which is a mathematical impossibility. So we’d have to give the non-occurring combinations some very, VERY small chance, just so as to not blow the mathematics.)

The question really is: if all these different possible combinations exist (or are mathematically unlikely, as is the case for many), how did these probabilities get established in the first place?

The answer is: the free energy equation already got minimized in the course of humans evolving a particular language.

Now, isn’t that a weird thought?

But if we ponder for a moment, it makes sense.

The native language speakers for each language evolved some sort of implicit enthalpy function; it was enough to skew the probabilistic likelihood for each bigram (and trigram, and quadgram, … ) away from an even distribution (which would have maximized entropy) and towards a very language-specific set of probabilities.

When we come across an entropy term that seems to exist apart from the whole free energy equation, it means that the free energy DOES exist; it's already been minimized, and we're reverse-engineering to find the entropy. — When we come across an entropy term that seems to exist apart from the whole free energy equation, it means that the free energy DOES exist; it’s already been minimized, and we’re reverse-engineering to find the entropy.

There was indeed an enthalpy function at work; it was just so hugely complex that we’d NEVER be able to write it down.

Instead, for each language, what we’ve got is a uniquely characteristic set of N-grams. These come about from a language-system that is already free-energy-minimized.

What we’re doing is reverse-engineering to find the entropy, or the specific set of probabilities that are associated with the free energy minimized state.

Summing It Up

Those of us doing work in machine learning and/or information theory come across an entropy-like formalism in all sorts of situations. We typically have one of two possible situations:

There IS a complete free energy equation, or a model of a free energy equation, and free energy minimization is part-and-parcel of what we’re doing. This is an element of energy-based machine learning models, of adaptive (approximate) Bayesian reasoning (otherwise called variational Bayes), and related areas.
Any mention of free energy (and entropy’s sister function, enthalpy) is surprisingly missing from the conversation. In this case, there STILL IS a free energy – it’s just already been minimized. We don’t write down the enthalpy function, because we don’t know it and it’s likely too complex. Instead, we’re reverse-engineering to get at the entropy, which we use as a measurement tool in various ways.

Just getting our heads around this much is a significant aspect of mastering the Force .

Just because we don’t see something, that doesn’t mean it isn’t there. It can be implicit, lurking in the background.

What we’re after is building a solid, sustainable foundation for working with the advanced machine learning methods – the sort that we do after we’ve played with all the simpler methods, including things like convolutional neural networks (CNNs) and Long Short-Term Memory (LSTM) networks. In short, anything that we can train using backpropagation.

We’re at that point of understanding how entropy connects with probability theory. We’re also ready to handle free energy just a bit more.

This is an important milestone in moving deeper into energy-based machine learning.

Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War

A Few Resources

Nice little information entropy video by Khan Academy; easy and only seven minutes!
A good description of information theory is in the Wiki on information theory.
My own reading list: Readings: Statistical Physics and Information Theory – not everything is filled in, and some of it is VERY advanced (not good as intro reads). However, it contains the all-important links to Shannon’s major paper, as well as his “Prediction and entropy of printed English,” the original Kullback-Leibler paper, and several other good ones. Bon appetit!

Alianna J. Maren

Statistical Mechanics, Neural Networks, Artificial Intelligence

What We Really Need to Know about Entropy

February 28, 2018 AJMaren Comments 3 comments

There’s This Funny Little “Gotcha” Secret about Entropy:

Getting Clear on Units and Notation

What This Really Means

Entropy and Information Theory

Summing It Up

A Few Resources

Previous Related Posts

Related

3 thoughts on “What We Really Need to Know about Entropy”

Leave a Reply Cancel reply

There’s This Funny Little “Gotcha” Secret about Entropy:

Getting Clear on Units and Notation

What This Really Means

Entropy and Information Theory

Summing It Up

A Few Resources

Previous Related Posts

Share this:

Related

3 thoughts on “What We Really Need to Know about Entropy”

Leave a Reply Cancel reply