The Surprising (Hidden) “Gotcha” in This Energy Equation:
A couple of days ago, I was doing one of my regular weekly online “Synch” sessions with my Deep Learning students. In a sort of “Beware, here there be dragons!” moment, I showed them this energy equation from the Hinton et al. (2012) Nature review paper on acoustic speech modeling:
One of my students pointed out, “That equation looks kind of simple.”
Well, he’s right.
And I kind of bungled the answer, because – yes. Taken just by itself, that equation really doesn’t look so bad.
Where the Real Challenge Lies
The challenge is not that equation. It is also not in using it for learning in the restricted Boltzmann machine (RBM).
The challenge is in the connection between energy-based machine learning and the real statistical mechanics.
There are similarities.
There are also differences.
Energy equations are used as a metaphor in machine learning – as a very strong metaphor.
Still, in machine learning, these equations really are just metaphors. Thus, it helps us if we can discern between the two realms; machine learning and statistical physics.
Energy and Free Energy – Not the Same Thing!
The first thing that we need to grasp is an important distinction in terms: there is energy, and then there is free energy.
They are not the same thing.
Sometimes, you’ll see authors write an equation like the one in the first figure, which I’ll reproduce here for convenience; this is the energy equation for the Boltzmann machine.
This gives the energy-per-unit (the two sums that have either v(i) or h(j) in them, respectively) and also the energy-per-interaction (the sum that has a pairwise combination of both v(i) and h(j)).
You’ll notice that there is no entropy in this formulation.
This is because entropy (a statement about the probabilistic distribution of units) is not a part of the energy equation.
Entropy is a part of free energy, and so is the energy (enthalpy) of a system.
Two Different Ways to Express the Free Energy
The free energy can be viewed in two different ways.
You can see both the energy (enthalpy) and entropy components in each expression. At the same time, they show up very differently.
The free energy expression that is most common in machine learning circles gives the free energy in terms of the partition function, as shown in the following equation, where Z is the partition function, and E(i) is the energy associated with each unit in the system.
This is a classic, fundamental equation. You can see it in all statistical mechanics books and teaching materials. (R. Salzman has an excellent online tutorial.) You will see the very same equation in all introductions and tutorials on energy-based machine learning, such as this tutorial by Bengio, and also in this Deep Learning introduction to Restricted Boltzmann Machines (RBM).
The very same free energy can also be expressed as a linear combination of enthalpy (which is the physical chemist’s term for energy) and entropy, as shown in the following equation.
In this equation, F is still the same free energy, E is the total energy (enthalpy) of the system (which is NOT the same thing as the free energy, as we’ve been discussing all along), S is the entropy, and T is the temperature.
Very often, when physicists or physical chemists use this equation to model systems, we’ll divide through by T, N, and k (Boltzmann’s constant, which is a part of the entropy term, but not shown in the equation above). (Similarly, N is not shown, it is subsumed into the F and E terms.) This gives us something called a reduced expression, which is both dimensionless and independent of the number of units involved.
The two equations are related, of course. Just not in a simple way. We’ll go through that derivation in the book, but not here.
Why the Free Energy is So Gosh-Darned Important!
The free energy is important to both physicists specializing in statistical mechanics and to machine learning people. The reason is simple.
Free energy minimization means that a system has come to equilibrium; it is in a stable state.
Natural systems seek out free energy minima all the time. If you perturb a system that is already at equilibrium, it will adjust itself until it reaches equilibrium once again. The equilibrium place may be different (because the system’s conditions have changed), but it is still an equilibrium.
This is the foundation for the metaphor that machine learning specialists use.
In machine learning, we construct an artificial system, made out of nodes that each have a specific energy, and sometimes also an energy for interacting with other nodes. Then we cause that system to seek a free energy minimum. This is the basis for all the gradient descent algorithms; they are all getting the system to an equilibrium state.
The key difference is that in a statistical mechanics-governed system, we are talking about something that is real; it exists in nature.
In machine learning, we’re dealing with something artificial; we’ve created this system and we’re causing it to follow nature-like laws.
There’s a difference in scale, also. In natural systems, the number of units involved in a statistical mechanics system is typically HUGE. In a machine learning system, the number of units may indeed be large – thousands or even more – but we can apply the algorithm just as well to a system composed of hundreds or even tens of units.
The Sneaky Way in which Entropy Shows Up in the Partition Function-based Free Energy Equation
In the equation given above (and reproduced to the right), where free energy is given as a linear combination of energy (enthalpy) and temperature time entropy, it is easy to find the entropy. It’s staring us right in the face.
So … where is the entropy in the partition function version of the free energy equation?
Right … HERE:
Notice that little summation index, i?
It’s counting up over microstates.
Microstates are the big contribution to the entropy term. In fact, microstates are what entropy is all about.
I have a big, long, eight-day tutorial about microstates that is essential if you’re studying energy-based models. You can access it by going to the book.
The BOOK. RIGHT HERE.
(If you’ve already signed up for this series, either on the BOOK page or on the Précis page, then you’re good. You don’t need to do this again.)
If you are doing this, then FOLLOW THROUGH; be sure that you move the daily emails that I’ll be sending you over to your primary folder. There’s a crucial one that you get as soon as you sign up, and then (starting about Day 3), a series of emails that take you to either very RICH Content Pages OR to full-on tutorial slidedecks.
At the end of this, you WILL understand microstates, and also the partition function. And you’ll be very close to understanding entropy.
What Free Energy Minimization Means (in Practical Terms)
The short answer is: when we minimize the free energy of a system, we’re finding a balance – a trade-off – between maximizing the entropy and minimizing the enthalpy (energy).
This is something that I’ll be covering in depth in the book.
So just as a quick note: in machine learning, we’re going to adjust the on/off-ness of the various nodes. We’ll do this with the goal of accomplishing two objectives, simultaneously:
- Free energy minimization – we get a stable, equilibrium system, AND
- Node connections give us desired (trained) results.
If you’re at all familiar with training the Boltzmann machine, and have heard of the Contrastive Divergence training algorithm, then you know that it works in two alternating parts. (So do other expectation maximization / (free) energy minimization algorithms.) That’s because the algorithm is achieving two goals at once.
What it also means is: we need to do more than simply minimize the energy equation. We’re minimizing free energy. That means that we’re in this delicate, complex, subtle, arcane, esoteric dance involving both entropy and energy.
What This Means to You
It may have been clear that we were doing some real statistical mechanics in this blog post.
It’s probably also been obvious that this has been a very fairy-kissed approach to the whole thing.
We have some crucial equations, but no derivations. No code. A lovely overview, but not deep.
(Darlings, I’m a girl. I do get to sprinkle in some fairy-kisses from time to time! And feather boas, and sparkly glitter. Just for fun.)
So … what this means to you.
At some point, you know that you’ll have to tuck into some serious equations. Preferably, equations + code, where you can see the correspondence.
Ideally, you’d be taking a full quarter-length course in energy-based learning methods with me.
Ideally, I’d have the book written, and all sorts of lovely slidedecks, YouTube vids, code, and other learning aids.
Give it a year, and I will. (You’re signed up for the BASIC email Opt-In list, aren’t you? That’s the one on the Right-Hand Sidebar. That will let you know about each weekly blog post, and ALSO significant new releases.)
Give it a year (or a little more), and we WILL have that energy-based models course in place at Northwestern.
However … this may not solve your immediate problem.
The (Ahem!) Call to Action
If you’ve gotten this far, then it’s more than casual curiosity. You’re clear that you need to go deeper.
This means that you’re at the point where you need an oracle, a guide.
If you were one of the great heroes of ancient Greece, you would have consulted the Oracle at Delphi before going off to do battle.
Lycurgus of Sparta asked for guidance from the Oracle. According to history, “The Oracle told Lycurgus that his prayers had been heard and that the state which observed the laws of Lycurgus would become the most famous in the world.”
If you’re a great Star Wars fan, it’s not just the fabulous cinematics and special effects.
It’s that you resonate with being a hero.
You ARE Luke Skywalker.
And at some point, before Luke goes off to face the Death Star and his father, Darth Vader, he consults with his oracles – he gets guidance. Partly, from his sister Leia. Partly, from Obi-Wan Kenobe and from Yoda.
The very last thing you would have done would have been to rush into battle without guidance, and without first studying from the best.
So, if you are embarking on a (long, lonely, arduous, challenging) journey into machine learning, you may need a bit of time with your own oracle.
- Private Office Hours,
- Private (Small Group) Short Course,
- Corporate (Your Site) Short Course.
If you think that any of these options are what you need, email me.
We’ll figure out what’s right for you.
Use firstname [at] domainname [dot] com,
where firstname = alianna, and domainname = aliannajmaren.
Live free or die, my friend –
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
P.S. – St. Crispin’s Day
St. Crispin’s Day (Oct. 25) was yesterday.
Back in August, in the Statistical Mechanics of Machine Learning blogpost, I told you that we’d start to open up some options, beginning on this day.
Well, my dear, the day has come.
I would tell you that you don’t need to email me right away.
The truth is, though, that maybe you do.
I can only work with a few people at a time.
Partly, it’s the time involved.
More than anything else, it’s that if you ask me a question, I’m not just going to fluff something off the top of my head.
You wouldn’t just drop in on the famed Oracle of Delphi, would you?
You’d know that if she was going to answer your question, she’d have to go pretty deep into a state of knowing and intuiting what you needed, and access some pretty deep wisdom.
This isn’t just a coffee date.
Very likely, whatever you ask me will require that I do some very deep digging, and putting together materials that will help us go over things together.
I don’t charge for the prep or the materials. Just for my time.
At the same time, it’s this background work that means that I can only work with a very small number of individuals at any one time.
If you think that you want to be – that you need to be – one of those very few, then email me now.
We can figure out if this is what you need, and start scheduling time for you.
Earlier being better than later, my friend.
References – Deep Neural Networks (Energy-Based Models)
- Hinton, G., et al. Deep neural networks for acoustic modeling in speech recognition. (Four research groups share their views.) IEEE Signal Processing Magazine 2 (November, 2012). doi:0.1109/MSP.2012.2205597. pdf-proof. Dr. A.J.’s Comments: This is by no means an easy read; while at a summary / review level, it presupposes that the reader already understands the field very well. However, for someone who is learning neural networks and deep learning, it is worth reading and rereading this paper several times. Take note, please: Eqn. 5 is a partition function, and Eqn. 6 is the energy function for a restricted Boltzmann machine (RBM); both of these require a bit of statistical mechanics. I give people a starting path for learning that in Seven Essential Machine Learning Equations: A Précis; you’d have to Opt-In and follow the procedures, it leads you into an eight-day tutorial sequence on microstates, which are foundational to partition functions, which are foundational to energy-based models.
- Y. Bengio, Learning Deep Architectures for AI, Foundation and Trends in Machine Learning, 2 (1), 1–127 (2009). pdf Very valuable resource.
Good Online Tutorials
- Deep Learning 0.1 Documentation Restricted Boltzmann Machines (RBM)
Some Useful Background Reading on Statistical Mechanics
- Salzman, R. Notes on Statistical Thermodynamics – Partition Functions, in Course Materials for Chemistry 480B – Physical Chemistry, 2004. Statistical Mechanics (chapter). Online book chapter. This is one of the best online resources for statistical mechanics; I’ve found it to be very useful and lucid.
Previous Related Posts
- Seven Essential Machine Learning Equations: A Cribsheet (Really, the Précis),
- Statistical Mechanics of Machine Learning blogpost – the great “St. Crispin’s Day” introduction of statistical mechanics and machine learning.