A “Hidden Layer” Guiding Principle – What We Minimally Need
Putting It Into Practice:
If we’re going to move our neural network-type architectures into a new, more powerful realm of AI capability, we need to bust out of the “sausage-making” mentality that has governed them thus far, as we discussed last week. To do this, we need to give our hidden layer(s) something to do besides respond to input stimulus.
It’s very realistic that this “something” should be free energy minimization, because that’s one of the strongest principles in the universe (along with the law of gravity, the law of attraction, etc.).
The notion that we’re playing with, then, is that we’re adding another processing capability to the current two, for a total of three major operational modes:
- Pattern recognition training: This, or any other form of training, will use some gradient descent algorithm; backpropagation, energy-based minima finding, or any of the other algorithms popular in deep learning circles these days,
- Pattern recognition: This is where the network does what its been trained to do; this can be a classification task, autoencoding, or anything else that requires response to an input stimulus, and
- Free association: This is actually a new operational realm, which may include many things besides free association; the key is that the hidden layer is no longer locked into responding to input stimulus.
From last week’s post on How Getting to a Free Energy Bottom Helps Us Get to the Top, we’re recollecting a figure that helps describe what we’re thinking about.
This figure shows what seems to be a classic neural network architecture, embedded in the broader notion of external and internal states (those are the ones at the top and bottom, denoted by the Greek letter psi, which looks like a pitchfork), using notation and a world-view advanced by Karl Friston (see How to Read Karl Friston, in the Original Greek for links to some of his core papers). Further, the input (bottom) and output (top) layers have the notation of being Markov blankets, with the outputs being the action layer and the inputs being the stimulus layer. For our purposes, we’ll just think of these Markov blankets as separations between the internal world (which is where we’re constructing a model) and the external world (which is the subject that we’re modeling).
In addition, the hidden layer is shown as being two dimensional, which is not that new or different – we can configure it to be in any topography that we want; same as we can do with the input and output layers. The hidden layer is composed of units in either state A or state B; this is also nothing new.
What is new and different about this hidden layer is the specific arrangement of the nodes in a sort of overlapping-brickwork kind of pattern. This is different from the usual (more expected) square pattern – the kind that we’d envision if we were mapping an input layer of image pixels. Thus, we’ve got this “skew” going on in our 2-D grid.
This is important; it’s not just a oddity in the graphics. However, we’ll defer detailed discussion of why this arrangement is useful; we’ll pick up on this next week.
Right now, let’s just imagine that we’ve put together a network with this basic configuration and trained it to do a task; classification would be a standard task to use in this example. This means that we’d have done processes (1) and (2) in the list above; we’d have (1) trained the network, and then (2) done some classification.
Let’s further assume that we’ve been using an energy-based approach (e.g., simulated annealing, Contrastive Divergence, that sort of thing) to train the connection weights between the input and hidden layer, and between the hidden and the output layer.
So far, this is very much the kind of thing we’ve been doing for a long time. Nothing at all new.
Now, we’ll try the new step, in a sort of Gedankenexperiment (a German word for “thought experiment”).
Gedankenexperiment: Letting the Network Think for Itself
We’re going to propose, and play with, a very rudimentary mechanism. We’re going to say that the hidden layer has done its job, and is now free to wander about. It is subject to the law of free energy minimization, and we’re going to give it little “burps” of stimulus. Really, just perturbations – nothing big, nothing tremendously exciting. Nothing that makes it say, “Oh, I’ve got an incoming input pattern that I need to recognize.”
Just little burps; enough for it to flip some of its nodes; one might go from A to B, and another from B to A.
Now, to think through what happens next, we really need to be looking at the free energy equation. However, we really haven’t defined the free energy for the hidden layer, operating in the absence of input stimulus.
In fact, the only free energy equation that we would likely have so far would be the classic one for training a (restricted) Boltzmann machine, which I first discussed with you in Neg-Log-Sum-Exponent-Neg-Energy – That’s the Easy Part!.
It’s worth our while to re-familiarize ourselves with this equation. First, in all the write-ups that you’ll see about energy-based methods, they will show this (kind of) equation, and say that it’s the energy of the system.
Actually, it’s what physicists and physical chemists call the enthalpy. It’s definitely an energy; it’s the energy contributed to the system by each node that is in an “on” (A) state, or “off” (B) state, with the idea that if a node is in state A, then it contributes a bit of energy to the system; this little bit of energy is a (if the node is in the input or output – the “visible” – layers), or b (if the node is in the hidden layer). Typically, the values for a and b are some positive number for the “on” (A) state nodes, and zero for those in the “off” (B) state; this is the classic Ising model. Also, there is an “energy” contributed by the connection weights v(i,j), connecting the input-to-hidden and hidden-to-output layers.
A lot of confusion can creep in for those not classically trained in statistical mechanics, and who haven’t grown up (more or less) with this equation. The reason is that (physicist) writers may use this equation, and say that it’s the “energy” of the system, but it’s not the free energy.
What we minimize in an energy-based model is the free energy, not just the enthalpy. The free energy is composed of two terms; enthalpy and entropy. They play together; one influences the other.
If you have not read my earlier post on this, and you have not taken at least a full semester of graduate-level statistical mechanics, please go back to my earlier post and read it; it will help clear up confusion – or at least help you become aware of the areas in which you’ve got some (latent, under-the-surface) less than crystal-clear understanding – so you’ll know what you need to learn. Here it is: Neg-Log-Sum-Exponent-Neg-Energy – That’s the Easy Part!.
Now, trusting that you’ve got some insight into this free energy = enthalpy – entropy game, let’s pretend that we’re going to apply this to just the hidden layer in the network. We’re not going to train for good connection weights; let’s assume that’s already done. What we’re after is some process that could happen AFTER we’ve trained connection weights and AFTER we’ve let the network respond to input stimulus, and now its meandering around, wondering what to do next.
We want to give the network (specifically, the hidden layer, because the input and output layers are presumably inactive at this moment) something useful to do, and this “something useful” should be free energy minimization.
We would use the classic Ising equation. This would not be very new to us; the equation above is a variant on this theme.
A Small Digression: The Classic Ising Equation
The figure below gives the classic Ising equation, which is a (complete) free energy equation for a bistate system, which is what we have. (Units can be in one of two states; A or B – “on” or “off,” which is what we have; thus bistate.)
The classic Ising equation for a bistate system gives the (reduced) free energy, bar-A, as the enthalpy minus the entropy.
If you’ll look at this equation, and the one in the figure above it, you’ll see strong similarities – because they’re essentially the same equation. The equation above this one (for the restricted Boltzmann machine) doesn’t show the entropy term; it just gives us the enthalpy. So we want to compare the three sums in that equation against the first two terms in the equation just above; one involving x, and the other involving x-squared.
The x in the equation above refers to the fraction of units in the “on” (A) state. Because we have a bistate system, if x is the fraction in A, then 1-x is the fraction in B. What the equation above is telling us is that only those units in A have an energy associated with them; this means that the parameters a and b in the first equation correspond to the parameter epsilon1 in the equation just above (for the case where units are in state A), and that a and b are both set equal to zero for the non-active nodes (the case where units are in state B). The important thing is to draw back and realize that the per-unit activation energy is a linear function of the fraction of active nodes, x, in both cases.
Similarly, there is an x-squared term in both equations. In the one immediately above, we see that there is a single parameter, epsilon2, multiplying the x-squared term. What this means is that in addition to the energy-per-active-node, we get an additional energy that is associated with pairwise interactions between active nodes.
Now, this doesn’t mean that we think that any one active node, let’s call it A(1), is interacting equally strongly with a nearest neighbor A, let’s call it A(2), as it would be with another A node (e.g., A(3)) that is some distance away. The parameter epsilon2 has a lot behind the curtain here; it represents (usually) a sort of “mean-field” response of the entire sea of active A nodes. That said, this second enthalpy term scales proportionately with x-squared in the equation above, and with v(i)h(j) in prior equation. This latter term is also proportional to the square of active nodes, if we think that the interaction enthalpy parameter w(i,j) is nonzero for those cases where both nodes are active.
In short, these two enthalpy expressions essentially represent the same thing. The differences amount, more or less, to subtleties of interpretation. However, they’re as similar as the same phrase in French and Italian.
The entropy term in the equation above is the absolute simplest bare-bones entropy. It is very classic; you have likely seen this in several contexts already.
As we compare these two equations, we’re not going to let the omission of the entropy term in the first equation bother us. It’s there implicitly. And once again; if you go back to my blogpost on Neg-Log-Sum-Exponent-Neg-Energy – That’s the Easy Part!, I give at least a gentle, fairy-kissed, pink-feather-duster overview of where that entropy comes in. And if you want to learn more, and haven’t done so yet, you’ll get a MUCH more detailed introduction to microstates, the partition function, and related topics if you sign up for the Machine Learning Opt-In list; this is on the BOOK page.
So we’ve established something very essential before we move forward; we’ve made the connection between the energy term used in the (restricted) Boltzmann machine energy-based learning model with the classic (and precursor) Ising equation.
Another Small Digression: Connecting the Hopfield Neural Network to the Ising Model, and the Boltzmann Machine to the Hopfield
I’d love to press on. However, we’ll all benefit if we take a moment to put together a few more jigsaw-puzzle pieces before moving into new territory.
The thing that we want to recognize is that we’re re-introducing the notion of free energy minimization into neural network behavior, but the first (and most significant) step was made by John Hopfield, back in 1982.
Hopfield’s work was pure, radical, breath-taking genius. We tend to forget what a huge step it was, because there have been so many advances since then. However, he was the first person to link up statistical mechanics with neural networks, which at that time didn’t have a single well-known learning rule. (I don’t mean to diminish the other major breakthrough insights by researchers such as Shun-Ichi Amari and others of that era; this is a thumbnail sketch.)
Hopfield created one of the first working neural networks, and of course we all call it the Hopfield network today. Here’s his original equations, in the figure below.
Notice the strong similarity between this equation and the one above. This particular equation has ONLY the interaction-enthalpy (pairwise interaction energy) term. It doesn’t have the per-unit energy term(s) that we saw in the previous equations. Once again, though – much the same thing. It’s as though we’re now looking at the same equation (more or less) in French, Spanish, and Italian. We can see the resemblance; we know that they’re talking about the same thing.
In the Hopfield neural network, the above equation (and yes, with an implicit entropy term, just as with the restricted Boltzmann machine) was the free energy that got minimized. However, the Hopfield neural network had problems. Yes, it was a brilliant, genius-level conceptual breakthrough – but it didn’t store patterns very well. It had troubles discerning patterns; the memory capacity was low.
The equally-genius breakthrough that Ackley, Hinton, and Sejnowski made just three years later – in 1985 – was to recognize that we didn’t necessarily have to connect up all the nodes in the network. They suggested that we use the Perceptron architecture (which at that time was NOT at all well-known). Their invention was even more remarkable because there was not another well-known training method for neural networks at that time. The work by Werbos inventing back-propagation didn’t really come to light until a year or so later, although Werbos made his original invention back in the 1970’s. One of the major insights that Ackley, Hinton, and Sejnowski further suggested was that we NOT have connections between nodes within a given layer; all the connections should be just between layers. This was in accordance with some early efforts at neural network architectures; Bernie Widrow had his ADELINE/MADELINE networks by then, but still – there was no good, working, pattern-recognizing neural network yet. This was all brand-new.
Thus, breakthrough stuff – using statistical mechanics principles to train neural networks. They called it “simulated annealing,” and the idea was to heavily perturb the network (as in, “heating” it), then letting it settle, and follow this perturb/settle process many times until the network finally resolved into a set of working weights. (Yes, I’m simplifying that considerably.)
The following figure shows how this line of thinking evolved.
(I introduced this connection in one of my previous posts; Brain-based Computing: Foundation for Deep Learning. This post has a couple of links to more reading material, but not much extra text, so you don’t really need to go back to this source unless you really want to.)
Getting Back to the GedankenExperiment: Finally Ready to Play
So, let’s return to our original objective. We’re going to think about what would happen if we did a free energy minimization across just the hidden layer of the neural network. We’re not going to worry about connecting hidden layer nodes to input and output nodes; that has already been trained. Instead, we’re asking ourselves what would happen if we were to perturb a few of those hidden layer nodes; causing them to shift from A to B and vice versa.
The answer is, in brief: nothing that we’d really like.
Damn! Too bad!
But really. If we did this, we’d be treating the hidden layer as something like a Hopfield neural network, and we know that it has memory problems. And we know that it has problems restoring itself to a known pattern once perturbed by noise … so it seems that we wouldn’t gain anything after all.
The problem is that if we were to perturb this hidden layer, there is nothing that would bring it back to a known stable state. I’m not saying that it couldn’t get back to some sort of stable state; we’d trust the free energy minimization process to do its best, and get us to some sort of minimum. But it wouldn’t necessarily be a minimum that we know and love. It might be a very shallow minimum. It might be just useless.
And with no control on this thing other than letting free energy minimization run its course, we’re not guaranteed of any useful output at all.
Further, the only way that we could get the network back to its original state would be to introduce some sort of memory-stickiness.
This doesn’t feel good, does it?
We know in our gut that free energy minimization is the natural order of the universe.
We can also see that bringing it into the hidden layer – using just the simple, classic Ising equation approach discussed above – is not going to help us at all.
Let’s treat this as a bit of a cliff-hanger, shall we?
Until next time –
Live free or die, my friend –
AJ Maren
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
Some Useful Background Reading on Statistical Mechanics
- Hermann, C. Statistical Physics – Including Applications to Condensed Matter, in Course Materials for Chemistry 480B – Physical Chemistry (New York: Springer Science+Business Media), 2005. pdf. Very well-written, however, for someone who is NOT a physicist or physical chemist, the approach may be too obscure.
- Maren, A.J. Statistical Thermodynamics: Basic Theory and Equations, THM TR2013-001(ajm) (Dec., 2013) Statistical Thermodynamics: Basic Theory and Equations.
- Salzman, R. Notes on Statistical Thermodynamics – Partition Functions, in Course Materials for Chemistry 480B – Physical Chemistry, 2004. Statistical Mechanics (chapter). Online book chapter. This is one of the best online resources for statistical mechanics; I’ve found it to be very useful and lucid.
- Tong, D. Chapter 1: Fundamentals of Statistical Mechanics, in Lectures on Statistical Physics (University of Cambridge Part II Mathematical Tripos), Preprint (2011). pdf.
Previous Related Posts
- How Getting to a Free Energy Bottom Helps Us Get to the Top
- What’s Next for AI: Beyond Deep Learning
- Third Stage Boost – Part 2: Implications of Neuromorphic Computing
- Third Stage Boost: Statistical Mechanics and Neuromorphic Computing
- Neg-Log-Sum-Exponent-Neg-Energy – That’s the Easy Part!
- 2025 and Beyond
- Deep Learning: The Fast Evolution of Artificial Intelligence
- Statistical Mechanics of Machine Learning blogpost – the great “St. Crispin’s Day” introduction of statistical mechanics and machine learning.
- Brain-based Computing: Foundation for Deep Learning.