Skip to content
enengee
Go back

Neural Networks: Backpropagation

Introduction

In the Coursera Machine Learning course by Andrew Ng, I think week 5 is the hardest week so far, I had to do tons of research on backpropagation in order to fully understand it. It was really great when I eventually answered all the questions that came to mind.

This note is my attempt to explain the formulas appear in the week 5’s videos, hope it helps you to understand it too.

Backpropagation Algorithm

As usual, to minimize J(θ) J(\theta) using Gradient Descent, we need to compute J(θ) J(\theta) itself and:

J(θ)θij(l) \frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)} }

As it’s hard to solve this formula directly; we can modify it using a trick, which is applying the [Chain rule] chain-rule:

J(θ)θij(l)=J(θ)zi(l+1) zi(l+1)θij(l)     (1)\frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)} } = \frac{\partial J(\theta)}{\partial z_{i}^{(l+1)} } ~ \frac{\partial z_{i}^{(l+1)}}{\partial \theta_{ij}^{(l)} } ~~~~~(1)

The second factor can be modified as:

zi(l+1)θij(l)=j=0slθij(l)aj(l)θij(l)=aj(l) \frac{\partial z_{i}^{(l+1)}}{\partial \theta_{ij}^{(l)} } = \frac{\partial \sum_{j=0}^{s_{l}}\theta_{ij}^{(l)}a_{j}^{(l)}}{\partial \theta_{ij}^{(l)} } = a_{j}^{(l)}

with sl s_{l} is the number of units in l l-th layer (not counting bias unit), we iterate the sum from 0 including the bias unit.

And then we define a new value - “error” of node i i in layer l l:

δi(l)=J(θ)zi(l) \delta _{i}^{(l)} = \frac{\partial J(\theta)}{\partial z_{i}^{(l)} }

So the formula (1) (1) can be re-written as:

J(θ)θij(l)=aj(l)δi(l+1) \frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)} } = a_{j}^{(l)}\delta_{i}^{(l+1)}

So now we need to compute δi(l) \delta_{i}^{(l)} for all i,l i, l; and there is a recursive formula that helps us do this:

δi(l)=k=1sl+1δk(l+1)θki(l)g(zi(l))     (2) \delta_{i}^{(l)}=\sum_{k=1}^{s_{l+1}}\delta_k^{(l+1)}\theta_{ki}^{(l)}g'(z_{i}^{(l)}) ~~~~~ (2)

=k=1sl+1δk(l+1)θki(l)g(zi(l))(1g(zi(l))) =\sum_{k=1}^{s_{l+1}}\delta_k^{(l+1)}\theta_{ki}^{(l)}g(z_i^{(l)})(1-g(z_i^{(l)}))

=k=1sl+1δk(l+1)θki(l)ai(l)(1ai(l)) =\sum_{k=1}^{s_{l+1}}\delta_k^{(l+1)}\theta_{ki}^{(l)}a_i^{(l)}(1-a_i^{(l)})

with g(z)=11+ez g(z) = \frac{1}{1+e^{-z}} is the [sigmoid function] sigmoid-function, hence g(z)=g(z)(1g(z))g'(z) = g(z)(1-g(z))

And the vectorization form of it:

δ(l)=(θ(l))Tδ(l+1).g(zl) \delta^{(l)}=(\theta^{(l)})^{T}\delta^{(l+1)}.*g'(z^{l})

=(θ(l))Tδ(l+1).a(l).(1a(l)) =(\theta^{(l)})^{T}\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)})

with .* is the element-by-element (element wise) multiplication operation.

As we can observe: in order to calculate δ \delta in the current layer (l l), we need to know δ \delta of its next layer (l+1 l+1, and hence the name Backpropagation); and the starting point of this recursive formula is the δ \delta of the output layer (δ(L) \delta^{(L)}). Now let’s compute δ \delta for output layer:

At output layer, we have l=L l = L:

δi(L)=J(θ)zi(L) \delta_{i}^{(L)}=\frac{\partial J(\theta)}{\partial z_{i}^{(L)}}

As we’re using Stochastic Gradient Descent (SGD) method (this algorithm will update the θ \theta matrix for each training example), J(θ) J(\theta) now will be the cost function with respect to a single training example (x,y) (x, y) (differentiate this from the overall cost function):

J(θ)=J(θ,x,y)=ylog(hθ(x))(1y)log(1hθ(x))     (3) J(\theta) = J(\theta, x, y) = -ylog(h_{\theta}(x))-(1-y)log(1-h_{\theta}(x)) ~~~~~ (3)

Again, at the output layer, we have:

hθ(x)=a(L)=g(z(L))=11+ez(L) h_{\theta}(x)=a^{(L)}=g(z^{(L)})=\frac{1}{1+e^{-z^{(L)}}}

Substitute this into (3) (3), and then take derivative:

J(θ)z(L)=hθ(x)y \frac{\partial J(\theta)}{\partial z^{(L)}}=h_{\theta}(x) - y

As a result:

δi(L)=J(θ)zi(L)=(hθ(x))iyi \delta_{i}^{(L)}=\frac{\partial J(\theta)}{\partial z_{i}^{(L)}}=(h_{\theta}(x))_{i} - y_{i}

Now we will prove (2) (2):

δi(l)=J(θ)zi(l)=k=1sl+1J(θ)zk(l+1)zk(l+1)zi(l) \delta _{i}^{(l)} = \frac{\partial J(\theta)}{\partial z_{i}^{(l)} }=\sum_{k=1}^{s_{l+1}}\frac{\partial J(\theta)}{\partial z_k^{(l+1)}}\frac{\partial z_k^{(l+1)}}{\partial z_i^{(l)}}

=k=1sl+1δk(l+1)zk(l+1)zi(l)     (4)=\sum_{k=1}^{s_{l+1}}\delta_k^{(l+1)}\frac{\partial z_k^{(l+1)}}{\partial z_i^{(l)}} ~~~~~ (4)

Because we have:

zk(l+1)=j=0slθkj(l)aj(l) z_k^{(l+1)} = \sum_{j=0}^{s_l}\theta_{kj}^{(l)}a_j^{(l)}

so:

zk(l+1)zi(l)=j=0slθkj(l)aj(l)zi(l)=j=0slθkj(l)aj(l)zi(l) \frac{\partial z_k^{(l+1)}}{\partial z_i^{(l)}}=\frac{\partial \sum_{j=0}^{s_l}\theta_{kj}^{(l)}a_j^{(l)}}{\partial z_i^{(l)}}=\sum_{j=0}^{s_l}\theta_{kj}^{(l)}\frac{\partial a_j^{(l)}}{\partial z_i^{}(l)}

=j=0slθkj(l)g(zj(l))zi(l)=θki(l)g(zi(l))zi(l)=θki(l)g(zi(l)) =\sum_{j=0}^{s_l}\theta_{kj}^{(l)}\frac{\partial g(z_j^{(l)})}{\partial z_i^{}(l)}=\theta_{ki}^{(l)}\frac{\partial g(z_i^{(l)})}{\partial z_i^{}(l)}=\theta_{ki}^{(l)}g'(z_i^{(l)})

Substitute this into (4):

δi(l)=k=1sl+1δk(l+1)θki(l)g(zi(l)) \delta _{i}^{(l)}=\sum_{k=1}^{s_{l+1}}\delta_k^{(l+1)}\theta_{ki}^{(l)}g'(z_i^{(l)})

Backpropagation algorithm:

And we have:

J(θ)θij(l)=Dij(l) \frac{\partial J(\theta)}{\partial \theta_{ij}^{(l)} }=D_{ij}^{(l)}

Backpropagation Intuition

There is no δ \delta value for bias units (actually we can still compute it but no need to use it).

Backpropagation is totally opposite to Feedforward propagation, with each training example (x,y) (x, y):

and the weight matrices (θ \theta matrices) are used similarly in two algorithms.

Backpropagation in Practice

Implementation Note: Unrolling Parameters

Previously to optimize the cost function of Linear Regression and Logistic Regression, in Octave we do:

function [jVal, gradient] = costFunction(theta)
...
optTheta = fminunc(@costFunction, initialTheta, options)

with theta theta and gradient gradient are vectors.

Now in Neural Networks, they are not vectors anymore (but matrices)=> we “unroll” them into vectors.

For example, if we have:

θ(1)R10×11 \theta^{(1)}\in\mathbb{R}^{10×11}

θ(2)R8×1 \theta^{(2)}\in\mathbb{R}^{8×1}

and we want to put them in a single vector, in Octave we can do this:

thetaVec = [ theta1(:) ; theta2(:)];

theta1(:) theta1(:): put all theta1’s columns into 1 single column vector. And to get back:

theta1 = reshape(thetaVec(1:110), 10, 11);
theta2 = reshape(thetaVec(111:118), 8, 1);

Gradient checking

Two sides difference estimate:

gradApprox=dJ(θ)dθ=J(θ+ϵ)J(θϵ)2ϵ gradApprox=\frac{dJ(\theta)}{d\theta}=\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}

this will give us a numerical estimate of the gradient at that point.

In case θ \theta is a vector:

J(θ)θi=J(..,θi+ϵ,..)J(..,θiϵ,..)2ϵ \frac{\partial J(\theta)}{\partial \theta_i}=\frac{J(..,\theta_i+\epsilon,..)-J(..,\theta_i-\epsilon,..)}{2\epsilon}

In implementation (again, θ \theta is a vector unrolled from weight matrices):

for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) = thetaPlus(i) + EPSILON;
  thetaMinus = theta;
  thetaMinus(i) = thetaMinus(i) - EPSILON;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON);
end;

and then check that gradApproxDVec gradApprox \approx DVec, with DVec DVec is the derivative we got from Backpropagation.

Implementation Note:

Important:

The main reason why we use the backprop algorithm rather than the numerical gradient computation method during learning is that the numerical gradient algorithm is very slow (and the backprop one is much more efficient).

Random Initialization

In Neural Networks, initializing all θ \theta values as 0 (or a same value) doesn’t work. Because after each update, parameters corresponding to inputs going into each hidden units (headed units) are identical => highly redundant representation => prevents the network from doing something interesting.

Solution: Random Initialization (symmetry breaking).

Initialize each θ \theta to a random value in [ϵ,ϵ] [-\epsilon, \epsilon]:

theta1 = rand(10, 11) * (2*INIT_EPSILON) - INIT_EPSILON;
theta1 = rand(8, 1) * (2*INIT_EPSILON) - INIT_EPSILON;

The ϵ \epsilon here has nothing to do with the ϵ \epsilon used in gradient checking.

One effective strategy for choosing ϵinit \epsilon_{init} is to base it on the number of units in the network. A good choice of ϵinit \epsilon_{init} is:

ϵinit=6Lin+Lout \epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} + L_{out}}}

where Lin=sl L_{in} = s_{l} and Lout=sl+1 L_{out} = s_{l+1} are the number of units in the layers adjacent to θ(l) \theta^{(l)}.

Putting it together

The first thing we need to do when training a networks: pick some network architecture. => how many hidden layers, how many units in each layer…

Number of input units: Dimension of feature x(i) x^{(i)}.

Number of output units: Number of classes.

Reasonable default:

Steps to train a neural networks:

In implementation, we use a for loop:

for i = 1:m
  % perform forward propagation and backpropagation using example (x(i), y(i))
  % (get activation and delta term for each layer)

In Neural Networks, J(θ) J(\theta) is a non-convex function; so gradient descent algorithm can get stuck in local minima. In practice, this is not a huge problem; because it can find a good local optima if it doesn’t even get to the global one.

References

[1] Backpropagation Algorithm

[2] How to derive errors in neural network with the backpropagation algorithm?


Share this post on:

Previous Post
Logic of the Tortoise and Hare algorithm