Selecting a Neural Network Transfer Function: Classic vs. Current

October 4, 2017 AJMaren Comments 2 comments

Neural Network Transfer Functions: Sigmoid, Tanh, and ReLU

Making it or breaking it with neural networks: how to make smart choices.

Why We Weren’t Getting Convergence

This last week, in working with a very simple and straightforward XOR neural network, a lot of my students were having convergence problems.

The most likely reason?

Very likely, it’s been my choice for the transfer function.

Neural network for the X-OR problem, showing the credit assignment "backpropagation" path. The bias nodes (included in the actual network) are not shown here. — Neural network for the X-OR problem, showing the credit assignment “backpropagation” path. The bias nodes (included in the actual network) are not shown here.

I had given them a very simple network. (Lots of them are still coming up to speed in Python, so simple = good.)

The network diagram is on the left; it shows the flow of credit assignment from the final summed-squared-error (SSE) to the input nodes, and particularly to the various connection weights. (The bias terms for both the hidden and output nodes are included in the code, but are not shown in this figure.)

My students were getting convergence sometimes, but not all the time.

Since they were (mostly) newbies to neural networks, they were doing things like running their network for many, MANY iterations – hoping that it would finally converge.

This sometimes-converging and sometimes-not was frustrating for them. Not to mention, it was taking too much of their time.

It’s ok for all of us to experiment, but we don’t have infinite time for go-nowhere experimentation, so I stepped back to think about root cause.

The most likely trip-up? It wasn’t playing with our scaling parameters – the alpha inside the transfer function, or the eta that controlled our learning rate.

Most likely, the problem was that I was using the same sigmoid (logistic) transfer function in BOTH the hidden and output layers.

For a simple classification network, a sigmoid transfer function for the output nodes makes sense. 1 or 0, yes or no.

This tends to make things tricky for the hidden layers.

Here’s the equation that’s impacted; it is the change to a specific hidden-to-output connection weight, v(h,o).

If the inputs to a hidden node are negative, then the output of that node is close to zero, which means that it’s hard to backpropagate a change value if that hidden node’s output is involved. (That would be to the H-sub-h in the equation above.)

The most likely useful fix is to use two different transfer functions:

Output nodes: Sigmoid (logistic) transfer function, and
Hidden nodes: Tanh (hyperbolic tangent) transfer function.

It’s time for me to add just a bit more to that code. It may very well be that experimenting with two versions – one that is sigmoid (logistic) function only and another one that uses the sigmoid for the output layer and the hyperbolic tangent (tanh) for the hidden layer – will give us a very nice side-by-side comparison.

Here’s a brief recap of transfer functions, including the best links that I could find while prepping this week.

Classic Transfer Functions: Sigmoid and Tanh

When functional, problem-solving neural networks emerged in the late 1980’s, two kinds of transfer functions were most often used: the logistic (sigmoid) function and the hyperbolic tangent (tanh) function. Both of these functions are continuous (smooth), monotonically increasing, and bounded. The sigmoid function is bounded between 0 and 1, and the hyperbolic tangent (tanh) function is bounded between -1 and 1.

Two classic neural network transfer functions. (a) Sigmoid (logistic) function, (b) Hyperbolic tangent (tanh) function.

When I first started working with neural networks, I used the sigmoid function. Even now, it’s a very good choice for an output unit transfer function, because it gives an output value between 0 and 1. This is useful if we’re doing a classification neural network.

The hyperbolic tangent (tanh) function has often been much more effective within the neural network itself; i.e., within the hidden nodes. The reason is that when we use the sigmoid function, a set of negative inputs into the node causes the sigmoid transfer function to produce an output close to 0. That means that it is then hard to change the weights associated with that hidden node; it essentially becomes a dead node.

Sidenote: This last week, working with the XOR neural network, my students were getting frustrated with how often the network failed to converge. They had bias nodes into both the hidden and output nodes, but the network that I’d given them used the sigmoid transfer function throughout. I bet that if I rewrote that code so that there was a hyperbolic tanget transfer function at the hidden node, and still kept the sigmoid transfer function at the output node, we’d get more convergence. That’s a little project for this week.

Current Transfer Functions: ReLU

More recently, the ReLU (Rectified Linear Unit) function has become popular. It is neither smooth nor bounded, but works well in applications that have very large numbers of units (e.g., convolutional neural networks, or CNNs) as well as for the Restricted Boltzmann Machine (RBM).

The basic ReLU function is neither continuous nor smoothly differentiable. However, we can code around that small issue. Also, there is a lovely approximation to the ReLU that is continuous and smoothly differentiable.

Both functions are shown in the following figure.

The ReLU approximation and the actual ReLU function; both are useful as transfer functions in convolutional neural networks (CNNs) and in restricted Boltzmann machines (RBMs).

Summary and What to Watch/Read

Quick Recap:

Hyperbolic tangent (tanh) function: For hidden nodes in most neural architectures that are trained using backpropagation, and for output nodes if you don’t mind the lower value being -1 instead of 0.
Sigmoid (logistic) function: For output nodes in a network where you want the outputs to be between 0 and 1.
ReLU: For hidden nodes in a convolutional neural network (CNN) where you are generating outputs of the convolutional layers, and for the hidden (“latent”) nodes in an RBM.
Logarithmic ReLU approximation: For where you want ReLU-like properties but with a smooth differentiation.

The readings below are selected from some good web sources. I highly recommend the online videos. The discussions in various technical forums are very good.

I’ve identified two high-quality papers that have been frequently cited; Michael Jordan for the logistic function (a 1995 classic), and a more recent paper by Nair and Hinton on the ReLU. Both are good if you need something a bit heavy-weight.

Enjoy!

Live free or die, my friend –

AJ Maren

Live free or die: Death is not the worst of evils.
Attr. to Gen. John Stark, American Revolutionary War

Good Online Video Tutorials

Andrew Ng, Coursera lecture 32: Activation Functions A very nice lecture, covers all the basics: sigmoid (logistic) function, tanh, and the ReLu. Only 10 minutes, very pleasant.
Andrej K.’s lecture on Backpropagation – also very good and worth 1 hr 10 minutes of your time.

Useful Online Technical Discussions

Backpropagation-specific Online Technical Discussions

Good Citable References

Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks. Technical Report, Massachusetts Institute of Technology. pdf
Nair, V., & Hinton, G.E. (2010). Rectified linear units improve restricted Boltzmann machines, Proc. ICML’10; Proc. 27th International Conference on Machine Learning (Haifa, Israel, June 21 – 24, 2010), 807-814. pdf

Previous Related Posts

Backpropagation: Not Dead, Not Yet – previous week’s post; why backpropagation is important – not just for basic and deep networks, but also for understanding new learning algorithms.
Deep Learning: The First Layer – an in-depth walk-through / talk-through the sigmoid transfer function; how it works, how it influences the weight changes in the backpropagation algorithm; similar arguments would apply to the tanh function.
Getting Started in Deep Learning – a very nice introductory discussion of (semi-)deep architectures and the credit assignment problem; backpropagation-oriented.

Alianna J. Maren

Statistical Mechanics, Neural Networks, Artificial Intelligence

Selecting a Neural Network Transfer Function: Classic vs. Current

October 4, 2017 AJMaren Comments 2 comments

Neural Network Transfer Functions: Sigmoid, Tanh, and ReLU

Why We Weren’t Getting Convergence

Classic Transfer Functions: Sigmoid and Tanh

Current Transfer Functions: ReLU

Summary and What to Watch/Read

Good Online Video Tutorials

Useful Online Technical Discussions

Backpropagation-specific Online Technical Discussions

Good Citable References

Previous Related Posts

Return to:

Related

2 thoughts on “Selecting a Neural Network Transfer Function: Classic vs. Current”

Leave a Reply Cancel reply

Neural Network Transfer Functions: Sigmoid, Tanh, and ReLU

Why We Weren’t Getting Convergence

Classic Transfer Functions: Sigmoid and Tanh

Current Transfer Functions: ReLU

Summary and What to Watch/Read

Good Online Video Tutorials

Useful Online Technical Discussions

Backpropagation-specific Online Technical Discussions

Good Citable References

Previous Related Posts

Return to:

Share this:

Related

2 thoughts on “Selecting a Neural Network Transfer Function: Classic vs. Current”

Leave a Reply Cancel reply