Figuring Out the Number of Hidden Nodes: Then and Now
One of the most demanding questions in developing neural networks (of any size or complexity) is determining the architecture: number of layers, nodes-per-layer, and other factors. This was an important question in the late 1980’s and early 1990’s, when neural networks first emerged.
Deciding on the network architecture details is even more challenging today.
In this post, we’re going to look at some strategies for deciding on the number of hidden nodes to use.
We’re going to situate this in a historical context. We’ll start with a very classic work by Gorman and Sejnowski (late 1980’s). Then, we’ll jump over thirty years, and look at some very recent and good work – essentially dealing with the same issue.
By the end of this post, we’ll have a sense for:
- How guidelines have evolved over thirty years (although some old guidelines are as good as ever), and
- How deep learning architectures actually give us simplicity within complexity (nice to know!).
Initial Network Configuration Strategies
One of the earliest practical neural network applications was when Gorman and Sejnowski (late 1980’s) used neural networks for passive sonar modeling of acoustic signatures. They presented examples using whales for the signature base. As they were Navy-funded, we all knew that the subject of interest was something underwater; something definitely NOT a whale.
This is a really good paper, and worth reading now – some thirty years later – for three reasons:
- It addresses a real-world problem and shows very satisfactory results,
- It walks through a study of trying different numbers of hidden nodes, correlating performance with numbers of hidden nodes (including a network where there are NO hidden nodes; just input & output layers), and.
- They show us how to do a “deep dive” into analyzing just what those hidden nodes are doing.
Because of these reasons, this paper became an “instant classic.”
The above work led to other useful results. For example, David Moore’s 1991 thesis, Passive Sonar Target Recognition, addressed essentially the same problem, with a very similar approach.
The value in looking at this 1991 Thesis now is that the rules-of-thumb, or heuristics, generated then still hold true today.
The following extract illustrates this:
“… that as the ratio of input elements to hidden elements increases, the better the neural network becomes at generalization. Therefore, we had to choose more input elements in relation to hidden elements.” (pp. 30 – 31).
The author follows by saying
“… we looked at two rules-of-thumb which were commonly used by neural network designers. The first rule-of-thumb states that “the more complex the relationship between the input data and the desired output, the more PEs (processing elements) are normally required in the hidden layer” [Ref.10]. The second rule-of-thumb concerning the number of hidden layer units, h, can be best expressed in the formula” [Ref. 11, in Moore Thesis, 1991]
Moore used the following equation, which is still current (link is to a CrossValidated: StackExchange Q&A).
In this equation, the numerator, K, represents the number of training vectors available to train the network, and the variables m and n represent the number of processing elements occupying, the input and output layers, respectively. Note that Cf represents the data complexity factor.
These two works are representative of the kind of work that was being done in the late 1980’s and early 1990’s, when people were learning how to do neural network designs and configurations for the first time. They are not the only sources available; there was an absolute mushrooming of similar studies.
Dr. A.J.’s Comment: I would personally take the value for K in the numerator (the number of training vectors) with a grain of salt. Today, we often train neural networks with a very large number of training data sets; this does not imply that we need to make the number of hidden nodes that extremely large.
What To Do with Just a Single Layer
Deep neural network architectures, with multiple hidden node layers, are more complex than those with just a single hidden layer. This means that there are more architectural decisions: how many layers, how many nodes per layer, etc.
In the deep learning course that I’ve been teaching at Northwestern University, students experiment with increasingly more complex neural networks. In the simplest possible configurations, they only deal with a few parameters:
- Eta – the learning rate (eta > 1 increases the rate so much that the network can “overshoot” a desired minimum; eta = 0.5 – 0.75 seems to work well),
- Alpha – the slope of the transfer function (alpha > 1 increases the slope, alpha = 1 is a decent default),
- Epsilon – the “closeness to desired result” value that signals a stop to network training,
- Types of transfer functions – we’ve been using the logistic (sigmoid), but will likely switch to hyperbolic tangent for the hidden nodes and sigmoid for the output, in order to increase convergence rates and lower the number of training steps (iterations),
- Number hidden nodes – experimenting to find how many hidden nodes provide convergence (within a reasonable number of training steps) and yet do not have “wasted” nodes; ones that are not really trained to give useful distinctions, or which replicate the role of other nodes.
In particular, we want to be sure that we force the network to find features that usefully distinguish key characteristics of the different output classes. We want to be sure that these features encompass different possible variants of the same thing, or different input variants that belong to the same class.
This need to force the network to learn features means that we don’t want to let the network “cheat,” that is, have each hidden node learn a specific distinct input, and then pool those inputs towards the appropriate output class. This is why we typically want the number of hidden nodes to be limited.
Ultimately, we have to set up various configurations. Then we need to train, test, and then analyze the responses of the hidden nodes to the different training data and output combinations to see what the hidden are doing. This is a trial-and-error process.
So, the question might be: if it is this complicated with just a single layer, how can we possibly find any simplicity or guiding rules when we make more complex networks?
Deep Learning Networks: Complexity … and (Surprising!) Simplicity
Clearly, if we can put more hidden layers into a network, we dramatically boost the network’s complexity.
What then, can we say about simplicity? Have we just created an exponentially more difficult problem, or do we have some insights?
Surprisingly, there are useful insights that have emerged over the past decade. A review paper by LeCun, Bengio, and Hinton (in Nature, 2015), summarizes these very well.
- Better understanding of the gradient landscape,
- Learning the implicit feature model in the layer immediately below, and
- Pre-training to get a good rough starting set of weights.
Again, somewhat surprisingly, it is these last two points – taken together – that reduce the complexity of earlier neural network models. While there may be more layers involved, what deep networks are essentially doing is breaking a classification problem down into smaller, simpler classification processes, and then assembling those decision layers.
Deep learning architectures introduce simplicity by having each hidden layer model features in the preceding layer, breaking down the classification problem into a series of smaller (and simpler) modeling and classification tasks.
The LeCun et al. paper is worth a read. Not so much for getting exact formulae on what to do where, but more for an overall sense of design philosophy.
It’s kind of amazing, given the complexity of current deep learning models, to realize that there are some simple and pure principles underlying such complex constructions.
Just one more note: interpreting acoustic signals (now speech understanding as a very important application) has evolved hugely over the past three decades. I give links to two more recent papers in the References – Deep Neural Networks section below. You might notice that it’s not so much the architectures that are different. (Yes, they are, but that’s secondary.) The learning rules now follow more of an energy-based model. This approach was certainly around some thirty years ago, but has grown in prominence over the last ten years. The whole notion of energy-based modeling rests on statistical mechanics, and is significantly different from simple gradient descent using backpropagation.
Live free or die, my friend –
Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War
Useful Online Technical Discussions
- “How do I decide the number of nodes in a hidden layer of a neural network? I will be using a three layer model” (Quora). Dr. A.J.’s Comments: A good short answer.
- “How to choose the number of hidden layers and nodes in a feedforward neural network” (CrossValidated: StackExchange). Dr. A.J.’s Comments: A good longer answer; note that the equation given near the bottom of this Q&A (see @doug’s answer by Hobs) is identical with the one shown in the blog above; some things don’t change, and – this StackExchange Q&A provides a link to an excellent resource, see the next bullet (below).
- “Section – How many hidden units should I use?” (comp.ai.neural-nets FAQ, Part 3 of 7: Generalization). Dr. A.J.’s Comments: A much more complete and detailed answer; you can get to this post by going to AI-FAQ (Preamble), scrolling down to find the Table of Contents of various topics, identifying the topic of interest, and doing a copy-paste into their search box, which will take you to a selection of relevant pages. The depth and breadth of answers here is truly mind-boggling. Very worth checking out.
References – Early Neural Networks
- Gorman, R.P., and Sejnowski, T.J. Learned classification of sonar targets using a massively parallel network. IEEE Trans. on Acoustics, Speech, and Signal Processing, 36(7), 1135-1140 (July, 1988). pdf.
- Moore, David F. Passive Sonar Target Recognition Using a Back-Propagating Neural Network. Master’s Thesis (U.S. Naval Postgraduate School) (1991). pdf.
- My book chapter online access to Google Books
References – Deep Neural Networks
- LeCun, Y., Bengio, Y., and Hinton, G. Deep learning, Nature, 521. (28 May 2015). pdf. Dr. A.J.’s Comments: This isn’t a particularly easy start, but it covers a lot of ground (albeit quickly). It’s one of those papers that needs to be read many times, getting more depth and understanding each time. (Don’t worry if large portions are not clear the first several times that you read it; after multiple reads and after reading much of the other deep learning literature, more of it will sink in.) It covers straightforward, layered networks trained with stochastic gradient descent (SGD) using the backpropagation algorithm, convolutional neural networks, recurrent neural networks, and other topics. Overall, it is a great high-level discussion of deep learning up through the Fort Laramie point on the Oregon Trail of Machine Learning. That means, it does NOT discuss energy-based models. (The paper given below, however, does.)
- Hinton, G., et al. Deep neural networks for acoustic modeling in speech recognition. (Four research groups share their views.) IEEE Signal Processing Magazine 2 (November, 2012). doi:0.1109/MSP.2012.2205597. pdf-proof. Dr. A.J.’s Comments: This is by no means an easy read; while at a summary / review level, it presupposes that the reader already understands the field very well. However, for someone who is learning neural networks and deep learning, it is worth reading and rereading this paper several times. Take note, please: Eqn. 5 is a partition function, and Eqn. 6 is the energy function for a restricted Boltzmann machine (RBM); both of these require a bit of statistical mechanics. I give people a starting path for learning that in Seven Essential Machine Learning Equations: A Précis; you’d have to Opt-In and follow the procedures, it leads you into an eight-day tutorial sequence on microstates, which are foundational to partition functions, which are foundational to energy-based models.
- Rebai, I. et al. Improving speech recognition using data augmentation and acoustic model fusion. Procedia Computer Science, 112, 2017, 316-322. doi:10.1016/j.procs.2017.08.003. pdf. Dr. A.J.’s Comments: This usefully discusses some methods for working with speech and other acoustic data. The previous two articles are much more general in scope; this is specific.
Going beyond Fort Laramie: Start with the basics. Microstates => Partition functions => Energy (free energy) => Energy models, such as used in the restricted Boltzmann machine (RBM) and other energy-based neural networks and machine learning methods. Check out the book in progress.
Previous Related Posts
- Selecting a Neural Network Transfer Function: Classic vs. Current – transfer functions are an architectural decision at the microstructure level (inside the node); this blogpost discusses architectural decisions at the mesostructure level – the building blocks for a network. These two are complementary discussions.
- Backpropagation: Not Dead, Not Yet – why backpropagation is important – not just for basic and deep networks, but also for understanding new learning algorithms.
- Deep Learning: The First Layer – an in-depth walk-through / talk-through the sigmoid transfer function; how it works, how it influences the weight changes in the backpropagation algorithm; similar arguments would apply to the tanh function.
- Getting Started in Deep Learning – a very nice introductory discussion of (semi-)deep architectures and the credit assignment problem; backpropagation-oriented.