Machine Learning: Multistage Boost Process

Machine Learning: Multistage Boost Process

Three Stages to Orbital Altitude in Machine Learning

Antares rocket explodes on launch. NASA image.
Antares rocket explodes on launch, October 29, 2014. NASA image.

Several years ago, Regina Dugan (then Director of DARPA) gave a talk in which she showed a clip of epic NASA launch fails. Not just one, but many fails. The theme was that we had to risk failure in order to succeed with innovation.

This YouTube vid of rocket launch failures isn’t the exact clip that she showed (the “action” doesn’t kick in for about a minute), but it’s pretty close.

For some reason, that vid came to mind a few days ago – along with the realization that successfully getting anything into orbit requires not just one major blast-off event, but a series of events. We need our initial lift-off, and then a booster, and maybe even one more booster.

Getting to a functional level in machine learning is much like getting to orbital speed with a rocket launch. We need multiple boosters.

The challenge is not so much orbital altitude. It’s obtaining sufficient orbital speed.

As applied to those of us who are learning machine learning, “orbital speed” means that we have sufficient momentum so that we are not hugely bogged down (as in, nearly brought to a standstill) when we encounter the next equation.

The Saturn V moon rocket uses three stages to get into position. Similarly, we need three stages to boost our machine learning understanding.



Statistical Mechanics for Machine Learning


What might be both surprising and comforting is that we don’t need to know all of statistical mechanics; we don’t need a year-long course in the subject. We just need some very specific concepts.

Building from microstates to free energy minimization gives you a solid foundation for understanding energy-based models in machine learning.
Building from microstates to free energy minimization gives you a solid foundation for understanding energy-based models in machine learning.

Here’s how the key concepts lead into one another:

  1. Microstates: This is a fundamental notion; it deals with counting up distinct states that can have a certain kind of energy. (Energy, in this context, is something that each unique node or unit has; a specified value.) The very important thing to learn is how to count up the nonlinear progression of microstates.
  2. Probabilities and the Partition Function: Both of these depend both on energy and microstate-counting. The partition function, in particular, depends on both. The probability for any given configuration (any one specific microstate) also depends on both, because it is normalized by the partition function. If we understand microstates, these two concepts come about very naturally and easily.
  3. Free energy: Free energy is a function of the partition function; if we have the partition function, then we have free energy. It helps to spend more time understanding how free energy can decompose into energy (enthalpy) and entropy terms. There are also some very interesting games to play with decomposing the free energy function in a different way; this leads directly into things like the Contrastive Divergence algorithm.
  4. Free energy minimization: It’s worthwhile spending time knowing how the free energy (in a couple of standard forms) behaves as a function of various energy parameters. It’s good to see how the actual gradient works out, as several algorithms work towards free energy minimization.
  5. Free energy application to neural networks: Once we have the free energy notion, we can understand the Hopfield, Boltzmann machine, and restricted Boltzmann machine neural networks. We can also work with energy-based models that use free energy minimization.



Time Needed for the Second Boost Stage: Understanding Free Energy


If you’re coming into machine learning and do not have the benefit of a solid year of statistical mechanics behind you (i.e., you did not do graduate studies in either physics or physical chemistry), then you probably have a realistic concern about how long it will take to fill the gap.

Realistically, about three weeks. Three weeks, if you have the right teaching materials. Maybe two weeks, if you can get a few days off work and don’t have a social life.

These concepts are pretty abstract. They involve getting our brains into some very altered states in order to properly appreciate and internalize them.

That said, the amount of statistical mechanics that you really need to learn and internalize is very bounded. There are a lot of statistical mechanics concepts and applications that really are not at all important for doing energy-based models in machine learning.

So, while this box of new things to be learned is a box of some strange, awesome, and wondrous stuff – it is a pretty bounded box.

With the right teaching materials, it’s very doable.

The big advantage to you in learning this is that you achieve a certain velocity. That means, when you encounter an equation (say, the partition function or an energy equation) in the course of working your way through different machine learning methods, you can keep your speed up. You’re not going to lose your momentum and start a nosedive towards planet earth. You can keep going after your target.



Good Statistical Mechanics Teaching Materials


There are really two important things here:

  1. Microstates, and
  2. Everything else.

Really, I’m not trying to be cute or facetious with you. Microstates are a challenge because they involve understanding combinations of things; they’re a good application of what we learned in high school as “combinations and permutations.” By the time that you’re done with microstates, you’ll have gone from visualizing units in different activations states through a serious of rather intimidating equations.

There is one really huge saving grace in all of this: once you’ve learned microstates, and have worked your way through to the final equations, they are rock-solid and will not change. For the rest of your life.

Especially when we work with systems where the units can be in only one two distinct energy states (a bi-state system), we get microstates that are very much pre-defined. This means that our entropy is pre-defined. Our entropy equation (for a bi-state system) looks like an upside-down “U.” Free energy, when expressed in terms of energy (enthalpy) and entropy, has a negative sign in front of the entropy. That means that we’re looking at a normal “U” – one that has a distinct minimum. (This is something that really, really helps when we’re trying to minimize the free energy of a system; a useful free energy equation will have an energy expression that lets the minimum on the bowl-shaped negative-entropy term be easy to find.)

Thus, once we’ve learned microstates, and thus learned entropy (because entropy depends on microstates), we have our entropy term for the system that we’re working with, and it doesn’t change.

In contrast, the energy term does have one or more parameters that we can play with; that’s what makes it interesting and useful.

By the time that we’re thinking about free energy minimization, we’re thinking about the interplay between energy (enthalpy) and entropy, and it’s nice to know that part of this whole matter is stable. It’s the parameters in the energy (enthalpy) term that make life interesting.

That said, we need some teaching materials.

As I got more clear about the relationship between statistical mechanics and energy-based models in machine learning, I started looking around for teaching materials. This was mostly because I needed to re-educate myself.

Bluntly put, I didn’t find much that was helpful in the way of microstates. Most of the discussions were way too brief. If I was having a slow time re-teaching myself – in a subject where I already had taken graduate-level classes – how much worse was it going to be for someone who didn’t have that background?

That was one of my big motivations to put all my previous writing projects on hold, and focus on developing a single book that would make the connection between statistical mechanics and machine learning much more clear.

The book is not ready. Not by a long shot. Right now, it’s barely embryonic.

However, as a lead-in (while developing the Précis), I also put together a set of teaching materials on microstates. I’ve got them distributed across an eight-day tutorial series; this is a series of follow-up emails that you receive when you Opt-In on my book page. You get the same Opt-In on the blog page that introduces the Précis; Seven Essential Machine Learning Equations: A Cribsheet (Really, the Précis). The emails take you to Content Pages. Three of those Content Pages have links to slidedecks on microstates. (You get access to the first one with the very first follow-up email; the one that you get as soon as you Opt-In. That takes you to the same Content Page that has the link for the Précis.)

I don’t have all of the final equations in for microstates, but I give you a very solid, gently-ramped approach. The equations would just encapsulate what you’d have learned via the slidedecks, and lead to a more formal statement. The microstates slidedecks also take you through probabilities and the partition function, for each of several examples. This means that in just a week and a day, you’ve got the microstates, partition function, and probability equations nailed.

Actually, I need to write just a bit more … so when I’ve got that done, you’ll need about a week and a half.

The rest is connecting all of this with free energy, and playing with free energy until the connection to classic and more recent neural networks and energy-based models becomes clear.

If we had the right teaching materials … about another week and a half.

Sadly, those materials are not ready yet. (If I’m lucky, I’ll spend a lot of Christmas vacation writing them, and will come back and update this blog post in January, 2018.)

Until then, a solid regular statistical mechanics text can help you bridge the gap.

I recently came across a text by Prof. Claudine Hermann, Statistical Physics – Including Applications to Condensed Matter. The parts that are readable, are very readable. The parts that are not very readable (including most of the equations) would be just downright depressing if you’re not a physicist.

If you were using this text to self-study, I’d recommend reading:

  • Chapter 1: Statistical Description of Large Systems
    • Sect. 1.2: Classical Probability Density ; Quantum Density Operator (read for the historical sections, don’t worry about the equations)
    • Sect. 1.3: Statistical Postulates; Equiprobability
    • Sect. 1.4: General Properties of the Statistical Entropy
  • Chapter 2: The Different Statistical Ensembles. General Methods
    • Sect. 2.4: System in Contact with a Heat Reservoir, “Canonical Ensemble”

I list two more sources at the end of this blog; the one by David Tong is similar (in terms of approach and equations) to the one by Claudine Hermann. It also has some sections that give context, and even a bit of humor, and is certainly worth a look. (I used Tong’s descriptions of microstates while re-developing my own understanding.) I always find the very succinct presentation by R. Salzmann to be very helpful, and refer to that often.

I also wrote a Technical Report summarizing some basic statistical mechanics; Statistical Thermodynamics: Basic Theory and Equations. It was an exercise in self-education at the time; I was reading two very good statistical mechanics textbooks, and cross-correlating. However, this again is probably too abstract for people trying to self-educate.

All in all, I still think that these texts are way too much in the realm of “physicists writing for physicists.” They’re just hard to understand if you’re not particularly interested in the physical chemistry approach that underlies these books / monographs.

This is why I’m writing the book. It’s not to dumb things down. The core equations that we really need have to be there.

I just think that there are simpler ways to get at this material, and I’m working on them.



Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War


References – Deep Neural Networks (Energy-Based Models)


  • Y. Bengio, Learning Deep Architectures for AI, Foundation and Trends in Machine Learning, 2 (1), 1–127 (2009). pdf Very valuable resource.


Good Online Tutorials



Some Useful Background Reading on Statistical Mechanics


  • Hermann, C. Statistical Physics – Including Applications to Condensed Matter, in Course Materials for Chemistry 480B – Physical Chemistry (New York: Springer Science+Business Media), 2005. pdf. Very well-written, however, for someone who is NOT a physicist or physical chemist, the approach may be too obscure.
  • Maren, A.J. Statistical Thermodynamics: Basic Theory and Equations, THM TR2013-001(ajm) (Dec., 2013) Statistical Thermodynamics: Basic Theory and Equations.
  • Salzman, R. Notes on Statistical Thermodynamics – Partition Functions, in Course Materials for Chemistry 480B – Physical Chemistry, 2004. Statistical Mechanics (chapter). Online book chapter. This is one of the best online resources for statistical mechanics; I’ve found it to be very useful and lucid.
  • Tong, D. Chapter 1: Fundamentals of Statistical Mechanics, in Lectures on Statistical Physics (University of Cambridge Part II Mathematical Tripos), Preprint (2011). pdf.


Previous Related Posts


Return to:

Leave a Reply

Your email address will not be published. Required fields are marked *