Generative vs. Discriminative – Where It All Began

Generative vs. Discriminative – Where It All Began

Working Through Salakhutdinov and Hinton’s “An Efficient Learning Procedure for Deep Boltzmann Machines”

 
We can accomplish a lot, using multiple layers trained with backpropagation. However (as we all know), there are limits to how many layers that we can train at once, if we’re relying strictly on backpropagation (or any other gradient-descent learning rule).

This is what stalled out the neural networks community, from the mid-1990’s to the mid-2000’s.

The breakthrough came from Hinton and his group, with a crucial paper published in 2006, and a more detailed paper in 2012.

Here are the crucial links:

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. pdf (Accessed July 1, 2018.)
  • Salakhutdinov, R. and Hinton, G. (2012, August). An Efficient Learning Procedure for Deep Boltzmann Machines, Neural Computation, 24(8), 1967-2006. pdf (Accessed July 1, 2018.)

 

Note: These links are given again at the tail-end of this blog, along with other useful reads, including (but not limited to):

  • Good (easy) pieces that give historical context,
  • These crucial reads (together with abstracts),
  • Related important refereed articles,
  • Useful technical blogs, and
  • A selection of my own most-related and useful blogs on this subject.

 

The papers given above are the breakthrough papers. They encapsulate what transformed neural networks into “deep learning” (as Hinton coined the phrase).

The challenge is that these papers express a worldview that is not only encapsulated within statistical mechanics, it is further framed within the realm of Hinton’s own work.

This means: unless you come to these papers with at least a reading-level understanding of statistical mechanics, and you understand how Hinton and his colleagues have developed that particular kind of neural network thinking, it will be very hard to read these papers. Almost impossibly hard, if you don’t know stat mech as a solid starting point.

The corollary problem is that it is also very hard to learn statistical mechanics on your own. (I’ve written about this, a lot. This is what has inspired me to start writing The Book, because the books out there are all written by physicists for physicists, not for lay-people who want just the essential equations so they can go on to artificial intelligence and (there’s that Hinton-phrase again) deep learning.)

However.

It will be a while before I can get that book done.

In the meantime, I’ll be teaching a new course this fall, in Northwestern University’s Master of Science in Data Science Program; it will by MSDS 458: Artificial Intelligence and Deep Learning.

The people coming into this course will be a lot like you: smart, driven, willing to throw themselves at the wall. But if that wall is a sheer cliff ice-wall, then no amount of throwing oneself at it will suffice.

What’s needed is a way to learn the requisite materials, and that’s the purpose of this blog post. (This will take several posts, as the key 2012 article is a terse but still lengthy forty pages.)

So what I’m planning here is a series of tutorial articles. I’ll link back to where I’ve discussed some of the crucial concepts earlier.

 
Italian-renaissance-border-2-thin
 

Neural Network Architectures: Multilayer Perceptrons and (Restricted or Not) Boltzmann Machines

 

Our first step is to gain some clarity on the architectures. If we read a paper by Hinton, or any of his students and/or colleagues, we’ll see the phrase “restricted Boltzmann machine” or RBM. A restricted Boltzmann machine is simply a Multilayer Perceptron (MLP).

Neural network architectures: Multilayer Perceptron (MLP) and the Restricted Boltzmann Machine (RBM) have identical architectures. Figure from A.J. Maren, "Statistical Mechanics, Neural Networks, and Artificial Intelligence" (in progress). Used with permission.
Neural network architectures: Multilayer Perceptron (MLP) and the Restricted Boltzmann Machine (RBM) have identical architectures. Figure from A.J. Maren, “Statistical Mechanics, Neural Networks, and Artificial Intelligence” (in progress). Used with permission.

 

The difference between an MLP and RBM is not simply semantics, although yes, Hinton has certainly been marking his space via enforcing a certain terminology. Nor is it just enforcing a certain lineage, going back to his original invention of the Boltzmann machine, back in the 1985/1986 timeframe. (In academic circles, pedigree is important, and academics will cling to the pedigree of their concepts as firmly as barnacles to a rock.)

The real distinction is conceptual, and is deeply important. It is that the Boltzmann machine is a direct descendant of John Hopfield’s original network, which we all refer to now as a “Hopfield network.” This was a 100% connected network, with the exception only that there were no self-connections. A Hopfield network is shown in the following figure.

Before we go into the deep-dive of how the Boltzmann machine (both restricted and general) originate from the Hopfield neural network, let’s note something very important. In the figure above, the directional arrows (for both the MLP and the RBM) refer to the network in when doing a task. When actually running a task, both those networks are feedforward mode.

When they are being trained, there is a significant difference. Training for a classic MLP works by taking the difference between the actual output values and the desired output values (the error term), and using it to adjust the connection weights – both hidden-to-output, and input-to-hidden. The Boltzmann machine trains in a very different way, which is why you will NOT see directional arrows in figures produced by Hinton and colleagues, when describing any variant of the Boltzmann machine.

Neural network architectures: Multilayer Perceptron (MLP) and the Restricted Boltzmann Machine (RBM) have the same way for structuiring layers, but the way in which they train the connection weights is very different; MLPs feed error-correcting information back from the top to the bottom weight sets, and the RBM trains using energy-minimization. Figure from A.J. Maren, "Statistical Mechanics, Neural Networks, and Artificial Intelligence" (in progress). Used with permission.
Neural network architectures: Multilayer Perceptron (MLP) and the Restricted Boltzmann Machine (RBM) have the same way for structuiring layers, but the way in which they train the connection weights is very different; MLPs feed error-correcting information back from the top to the bottom weight sets, and the RBM trains using energy-minimization. Figure from A.J. Maren, “Statistical Mechanics, Neural Networks, and Artificial Intelligence” (in progress). Used with permission.

Even though the (restricted) Boltzmann machine and the Multilayer Perceptron nominally look the same during operation, they are entirely different animals. The difference is bit deeper than what was shown in the previous figure. The REAL difference is that MLPs are conceptually derived from the simple notion of Perceptrons, as introduced by Rosenblatt in 1958.

The original Perceptron architecture, from Frank Rosenblatt's classic 1958 paper, "The Perceptron:  A probabilistic model for information storage and organization in the brain." This original architecture did not have distinct and separate layers.
The original Perceptron architecture, from Frank Rosenblatt’s classic 1958 paper, “The Perceptron: A probabilistic model for information storage and organization in the brain.” This original architecture did not have distinct and separate layers.

The original Perceptron, as introduced in 1958 by Frank Rosenblatt, did not have differentiated layers. The notion of layers started with Bernie Widrow and colleagues, who worked with a simple two-layer network that they called ADALINE (or MADALINE, depending on the network). Although middle layers were possible, there was no way in which to train the connection weights so that they took on useful values. That breakthrough came through Paul Werbos, who invented the backpropagation learning method for his Harvard Ph.D. dissertation in 1974.

In contrast, the Boltzmann machine (as originally presented) is really a variation of the Hopfield neural network. Thinking of it as having separate “layers” is sort of an artifact; just a matter of drawing it in a certain way. What is really true is that Hinton et al. separated the nodes of a Hopfield neural network into two distinct groups, which they called “clamped” and “unclamped” or “not-clamped.”

While the (restricted) Boltzmann Machine (RBM) and the Multilayer Perceptron (MLP) have identical architectures during actual operation, the Boltzmann machine is derived from the Hopfield network. It's primary difference is that during training, some nodes are given values ("clamped"), while others are allowed to find their own values ("not-clamped"). These "not-clamped" or hidden nodes are trained so that they take on values that most allow the "clamped" nodes to recover desired values when given a partial "clamped" pattern. We can re-arrange these nodes into layers, where the "clamped" nodes become are further divided into a set of input nodes and a set of output nodes, while the "unclamped" nodes become the layer of hidden nodes. Figure from A.J. Maren, "Statistical Mechanics, Neural Networks, and Artificial Intelligence" (in progress). Used with permission.
While the (restricted) Boltzmann Machine (RBM) and the Multilayer Perceptron (MLP) have identical architectures during actual operation, the Boltzmann machine is derived from the Hopfield network. It’s primary difference is that during training, some nodes are given values (“clamped”), while others are allowed to find their own values (“not-clamped”). These “not-clamped” or hidden nodes are trained so that they take on values that most allow the “clamped” nodes to recover desired values when given a partial “clamped” pattern. We can re-arrange these nodes into layers, where the “clamped” nodes become are further divided into a set of input nodes and a set of output nodes, while the “unclamped” nodes become the layer of hidden nodes. Figure from A.J. Maren, “Statistical Mechanics, Neural Networks, and Artificial Intelligence” (in progress). Used with permission.

All of this puts us in a position to start with the first paragraph of Salakhutdinov and Hinton’s 2012 paper, as shown in the figure below.

Screenshot of the first paragraph of Salakhutdinov and Hinton's classic (2012) paper, An Efficient Learning Procedure for Deep Boltzmann Machines.
Screenshot of the first paragraph of Salakhutdinov and Hinton’s classic (2012) paper, An Efficient Learning Procedure for Deep Boltzmann Machines.

This exposition will have to wait until next week.

 
Italian-renaissance-border-2-thin
 

Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War

 
Italian-renaissance-border-2-thin
 

References & Resources

 

Fluffy, But Gives Good Historical Context

  • Allen, K. (2017). How a Toronto professor’s research revolutionized artificial intelligence. The Star (Fri., April 17, 2015). Available online at: Online article. (Accessed July 1, 2018.)

 

Most Crucial To-Reads (Journal and arXiv)

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. pdf (Accessed July 1, 2018.)
  • Salakhutdinov, R. and Hinton, G. (2012, August). An Efficient Learning Procedure for Deep Boltzmann Machines, Neural Computation, 24(8), 1967-2006. pdf (Accessed July 1, 2018.)

 

Abstract: We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned.

 

Academic Articles and Books – Supporting Materials

 

The Boltzmann Machine

  • Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. Cambridge, MA, MIT Press. pdf (Accessed July 10, 2018.)
  • Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147-169. pdf (Accessed July 10, 2018.)

 

The Hopfield Neural Network

  • Folli, V., Leonetti, M. and Ruocco, G. (2016). On the Maximum Storage Capacity of the Hopfield Model. Front Comput Neurosci. 10, 144. Published online 2017 Jan 10. doi: 10.3389/fncom.2016.00144. PMCID: PMC5222833; PMID: 28119595. Online acess (Accessed July 10, 2018.)
  • Hopfield J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. U.S.A. 79, 2554, doi:10.1073/pnas.79.8.2554. Online access (Accessed July 10, 2018.)

 

The Perceptron – Original and Multilayer (Including the Backpropagation Learning Method)

  • Rosenblatt, F. (1958). The Perceptron: a probabilistic model for information storage and organization in the brain. Psych Rev, 65(6), 386-408. doi:10.1037/h0042519. pdf (Accessed July 10, 2018.)
  • Werbos, P.J. (1974). Beyond regression: New tools for predicting and analysis in the behavioral sciences. (Doctoral dissertation). Retrieved from ResearchGate. pdf (Accessed July 10, 2018.)
  • Werbos, P.J. (1994). The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting (Adaptive and Cognitive Dynamic Systems: Signal Processing, Learning, Communications and Control) 1st Edition. Hoboken, NJ: Wiley-Interscience. Available in hardcover through Amazon.com (Accessed July 10, 2018.)
  • For more on the backpropagation learning rule, see the Useful Blogs and YouTube Vids, below.

 

Useful Blogs and YouTube Vids

  • Schmidhuber, J. (2014, updated 2015). Who Invented Backpropagation? Blogpost. Online access (Accessed July 10, 2018.)
  • Werbos, P.J. (2017). IJCNN 2017 Plenary Talk: Backpropagation in the Brain Pt. 1. YouTube vid. (Accessed July 10, 2018.)
  • Werbos, P.J. (2017). IJCNN 2017 Plenary Talk: Backpropagation in the Brain Pt. 2. YouTube vid. (Accessed July 10, 2018.)
  • Werbos, P.J. (2017). IJCNN 2017 Plenary Talk: Backpropagation in the Brain Pt. 3. YouTube vid. (Accessed July 10, 2018.)

 
Italian-renaissance-border-2-thin
 

Previous Related Posts

 

  • Blog links to be inserted here

 
Italian-renaissance-border-2-thin
 

Leave a Reply

Your email address will not be published. Required fields are marked *