Deep Learning: The First Layer

Deep Learning: The First Layer

It’s been something of a challenging week. We’ve kicked off our new PREDICT 490 Special Topics Course in Deep Learning at Northwestern University. I’ve got a full house; there’s been a waiting list since Thanksgiving, and everyone is acutely aware of the business climate surrounding deep learning.

However (and I’m not terribly surprised here), most people who want to learn Deep Learning (DL) really don’t have a solid foundation in neural networks just yet. Thus, what we’re really doing is cramming two courses into one; the foundations of neural networks, and then extending those foundations into the deep learning philosophy and practice.

Another challenge that most students are facing this week is that their knowledge of Python is a bit wobbly.

Even though there are lot of new Deep Learning resources out there, they are often a bit too abstract, or written with too much of an expectation that the reader understands both mathematics and neural networks, to be useful as introductory deep learning materials. Further, the available resources are not well-sorted. There’s a lot out there, but for someone who’s looking for guidance on what to read first, and where to give the most attention, it’s a tough call.

Brain-Based Computing; book in progress, with a focus on deep learning and its applications.
Brain-Based Computing; book in progress by A.J. Maren, with a focus on deep learning and its applications.

As a result, I’m writing my own deep learning book, provisionally called Brain-Based Computing. I’m also changing the book-writing plans that I had previously announced in this website; within a day or so, you should see all references to prior books-in-progress disappear, and the focus limited to two; one for deep learning, and another for text analytics / computational linguistics; both are courses that I’m teaching in Northwestern University’s Master of Science in Predictive Analytics.

I’ve also made a commitment to myself to write a blogpost about this subject every Thursday, whether ready or not.

It’s Thursday, Jan. 5th, 2017, and of course I’m not ready. That means, I’m not ready despite having put the entire weekend into class preps (including skipping church on Sunday, which means that I’m in serious overdrive mode). Not ready despite having chunked out ten pages of text for the new book, most of which is partial differential equations (the ever-famous derivation for the backpropagation equation), and being in the midst of writing Python code for the same. (More on that code later in this post.)

So, ready or not, it’s blogpost time, and I’m picking a topic that I haven’t really seen addressed. (This could, of course, be simply due to lack of sufficient reading on my part.)

What I’ll do in these next few paragraphs is take the core equation for the backpropagation of error, just for the first part (that is, applied to the hidden-to-output connection weights), and discuss what the equation really means.

First, here’s the basic equation.

backpropagation-part1_crppd_2017-01-05

This is the standard equation for the first part of backpropagation, and you can find the derivation in many places, including Brain-Based Computing. As a quick online reference, though, you might look at Matt Mazur, A Step-by-Step Backpropagation Example. I will shortly upload the chapter draft and my code to this site, and probably to ResearchGate as well. (Check back, please for the links … within a couple of days.)

Here’s a quick list of what the variables in this equation mean:

  • SSE: Summed Squared Error, which is taken across the entire set of output nodes (that’s the sum), but practically applies only to the squared error produced at the node for which the backpropagation of error is originating,
  • v (at h,o): Connection weight between the hidden node h and the output node o; connection weights between hidden and output nodes are termed v, and between input and hidden nodes w, at least until we start adding more layers of neurons and their weights,
  • E (at o): Error term at the output node o; E(o) = Desired(o) – Actual(o),
  • F (at o): Transfer function applied to the summed weighted inputs (NdInput) at the output node o; F(o) = F(NdInput(o)) = F(Sum(v(h,o)*H(h)), and in this derivation the standard sigmoid function is used (see below),
  • NdInput (at o): The total input to the node o, that is, the summed weighted inputs; Sum(v(h,o)*H(h),
  • H (at h): Output of the hidden node h,
  • alpha: Parameter in the transfer function F.

Note that the derivative of the transfer function, F, with respect to NdInput (the sum of weighted inputs into the node, and to which the transfer function is applied) is: -alpha*F(1-F), which gives us a few of the terms in the equation above.

We will decrement the weight v(h,o) by a small amount proportional to the value in the equation above; the partial derivative of SSE (the summed squared error) with the hidden-to-output weight v(h,o).

Let’s interpret this equation.

Suppose that we’re training a simple neural network with backpropagation, and that we desire an output of 1, and get an actual output less than 1; suppose it is about 0.5. Then we have (Desired(o) – Actual(o)) = 1 – 0.5 = 0.5, which is a positive number.

The classic sigmoid transfer function; F = 1/(1-exp(-alpha*x)),
The classic sigmoid transfer function; F = 1/(1-exp(-alpha*x)),

Let’s take alpha = 1. Let’s assume, for the moment, that NdInpt(o), the sum of the weighted inputs to the output node at o, is positive. Since this value is positive, the transfer function F applied to it will be greater than 0.5 (F>0.5). The derivative, F*(1-F) will also be positive. (It is always positive; F is a smoothly monotically increasing function.)

The derivative of the sigmoid transfer function.
The derivative of the sigmoid transfer function.

So, returning to our discussion of what we will do in terms of “backpropagating error” to the connection weight v(h,o): we will decrement this weight by a small amount proportional to -(error)*F(1-F)*H. We have already identified, for purpose of this example, that both the error term and H are positive, and that F(1-F) is always positive. (See the figure for the derivative of the transfer function, just above.)

Thus, we will decrement (subtract) from the initial weight v(h,o) a small amount proportional to a negative number, because the backpropagation result is a negative applied to a set of positive terms, and thus is negative. This means that we will actually increase v(h,o) just a little bit.

Let’s look at what happens when we increase v(h,o).

The activation (or output of) the output node is already (in our example) positive, and greater than 0.5, but not yet at its target of being near to 1.0. It is a result of the transfer function being applied to the node inputs, that is, it is F(NdInpts). The node inputs, in our example, are already positive, and in fact H (again just in our example) is positive. Thus, we will be increasing v(h,o) (moving it towards a higher positive value), so we will be increasing H, which will then increase NdInpts, which will then increase F(NdInpts), which moves us towards our goal of getting an output from the output node closer to 1. (Hmpf, the language is a little awkward, but it makes sense.)

What we’ve just accomplished has been to walk our way through the influence of the backpropagation step on a single hidden-to-output weight example, and verified – at least for this particular imaginary example – that the process works.

Let’s take just one more moment to consider the impact of the various terms here on the weight adjustment. What we’re really looking at is the details of the credit assignment problem, which was my kick-off topic last week for this series on deep learning.

What we’re after is a “does this make sense?” reality check, as we look at the specific terms in the equation posted above:

  • E (at o): The greater the error, the bigger the influence on the connection weight, v, which makes sense,
  • The derivative of the transfer function F (at o), which is -alpha*F(1-F): – txt here, thinking only about the terms involving the transfer function itself (F*(1-F)), we can see from the figure above that when the node inputs (NdInpt) are large positive or negative values, the derivative of F is a small positive value, so that there is minimal change to the connection weight when the node inputs are already large. Essentially, if the node inputs are already putting a massive value into the output node, there is little use in changing the weight just a little; they are already maxing out the node,
  • H (at h): Output of the hidden node h, and this is more interesting than the transfer function applied to all of the node inputs; here, there is a direct linear influence of the magnitude of the output of the specific hidden node for which v(h,o) is the connection weight; this is something that we’d want and expect, and finally
  • alpha: Parameter in the transfer function F; the greater the value for alpha, the greater the change to the connection weight.

This leads me to a couple of concluding remarks.

First, we want to use weight updates to push the output of our nodes – hidden as well as (final layer) output nodes – towards either 0 or 1. A node that outputs a middle-value, say in the neighborhood of 0.5, is not of much use.

Let’s take, as an example, a case where we had three hidden nodes, all generating an output of 0.5, and each with a positive connection weight of about 1, giving their inputs to an output node. The output node is then receiving 3*(1*0.5), or 1.5. It interprets this as a strong positive input.

However, three “eh’s, I don’t much care,” do NOT make for a strong, resounding, positive “Yes!.” In other words, three “eh” outputs from hidden nodes should not be generating a strong positive input into the final output node. We handle this by adjusting connection weights, of course, but it brings home the fact that we really want, at each level of the network, to drive the nodes towards either a 0 or 1 (approximately) output. This is true for the hidden as much as for the final output nodes.

This then means that the function of the transfer function (hmpf, wording again, please bear with me) is to drive the node towards an output of either extreme; either towards the right-hand-side (where the outputs are close to 1), or to the left-hand-side (where the outputs are close to 0). The middle area, which is an approximately linear scaling function, is really of not much use to us. (See the previous figure in this post, showing the sigmoid transfer function.)

This helps us to think through the role of alpha, which controls the steepness of the transfer function curve. When alpha = 1, we have a pretty good range – say the node inputs (NdInpt) are between -0.9 and 0.9 – where the inputs scale approximately 1:1 to a (transfer function) output of nearly the same value. That is, when alpha = 1, we’re not getting much of a push towards either of the transfer function extremes, which is what we want.

However, if we increase alpha (say, alpha = 2, as is shown in that figure), while we will get a transfer function “push” towards either 0 or 1 as an output, we’ll also impact the adjustments to the connection weights. In fact, we would be making twice the size of the change. If we’re in a tricky gradient landscape, this might move us too fast. In particular, it might move us too fast if we’re constantly adjusting a particular weight in response to different inputs and their desired outputs.

Alternatively, if we make our alpha too small (e.g., alpha = 0.5), we reduce the impact of change to the connection weights, but at the cost of having a “mushier” transfer function output; we’re not getting as clean a separation into high (approximately 1) and low (approximately 0) outputs; at least, not until we’ve adjusted the weights sufficiently to get either very large positive or very large negative outputs of the nodes that then become (weighted) inputs to the next layer’s transfer function.

Most of the time, people use an alpha of about 1.0. To influence the rate at which the connection weights are adapted, they may change the scaling function (not described in this tutorial) over time. That is, as the overall errors decrease (with training), they may decrease the extent to which the weights are adapted, leading to smaller and more subtle weight changes.

The literature is rife with strategies for tweaking the backpropagation method. In this tutorial, though, we’ve put our attention just on the simple function itself, and for the first part of backpropagation only. We’ve assessed the impact of the various terms in the backpropagation update rule, and determined the importance of various contributions.

The next post will continue with backpropagation just one more time, and then we’ll shift our attention to the Boltzmann machine, which has become the workhorse of deep learning algorithms. We’ll look, in particular, at how the two compare; the relative strengths and weaknesses of each, along with other factors for creating deep learning architectures.

If you haven’t already, please sign up to receive a short email letting you know about new posts in this series. (Use the Opt-In form on the right-hand sidebar.) That way, you’ll stay current with this ongoing series on neural networks and deep learning, without having to remind yourself to come back to this site.

Comments or questions? Use the Comments form below; I’ll get back to you.

Looking forward to our next conversation!

All my best – AJM

Neural Networks and Deep Learning Resources

Nils Goerke posts a very good response to a question raised by Peshawa Jammal Muhammad Ali on ResearchGate (an exchange similar to StackOverflow, but a bit more directed towards the theory, less to the programming), see How to select the best transfer function for a neural network model?. He’s advocating a hyperbolic tangent function, I’m using a sigmoid, they have similar shapes but are at different positions (left-to-right) on an x-y graph; I’m selecting the sigmoid more because it is a bit classic, and easily illustrates the points that I’m making here. Both are good.

5 thoughts on “Deep Learning: The First Layer

Leave a Reply

Your email address will not be published. Required fields are marked *