12

Let's assume that we are using a batch size of 100 samples for learning.

So in every batch, the weight of every neuron (and bias, etc) is being updated by adding the minus of the learning rate * the average error value that we found using the 100 samples * the derivative of the error function with the respect to the current neuron weight that is being updated.

Now, when we use a Max Pool layer, how can we compute the derivative over this layer? In every sample that we feed forward, a different pixel (let's say) is chosen as the max, so when we backpropagate over 100 samples in which every time a different path was chosen, how can we do it? A solution I have in mind is to remember every pixel that was chosen as the maximum, and then maybe split the derivative over all the max-pixels. Is this what's being done?

Nathan G
  • 241
  • 1
  • 2
  • 5

2 Answers2

11

When a neural network processes a batch, all activation values for each layer are calculated for each example (maybe in parallel per example if library and hardware support it). Those values are stored for possible later use - i.e. one value per activation per example in the batch, they are not aggregated in any way

During back propagation, those activation values are used as one of the numerical sources to calculate gradients, along with gradients calculated so far working backwards and the connecting weights. Like forward propagation, back propagation is applied per example, it does not work with averaged or summed values. Only when all examples have been processed do you work with the summed or averaged gradients for the batch.

This applies equally to max pool layers. Not only do you know what the output from the pooling layer for each example in the batch was, but you can look at the preceding layer and determine which input to the pool was the maximum.

Mathematically, and avoiding the need to define indices for NN layers and neurons, the rule can be expressed like this

  • The forward function is $m = max(a,b)$

  • We know $\frac{\partial J}{\partial m}$ for some target function J (in the neural network that will be the loss function we want to minimise, and we are assuming we have backpropagated to this point already)

  • We want to know $\frac{\partial J}{\partial a}$ and $\frac{\partial J}{\partial b}$

  • If $a > b$

    • Locally,* $m = a$. So $\frac{\partial J}{\partial a} = \frac{\partial J}{\partial m}$

    • Locally,* $m$ does not depend on $b$. So $\frac{\partial J}{\partial b} = 0$

  • Therefore $\frac{\partial J}{\partial a} = \frac{\partial J}{\partial m}$ if $a > b$, else $\frac{\partial J}{\partial a} = 0$

  • and $\frac{\partial J}{\partial b} = \frac{\partial J}{\partial m}$ if $b > a$, else $\frac{\partial J}{\partial b} = 0$

When back propagation goes across a max pooling layer, the gradient is processed per example and assigned only to the input from the previous layer that was the maximum. Other inputs get zero gradient. When this is batched it is no different, it is just processed per example, maybe in parallel. Across a whole batch this can mean that more than one, maybe all, of the input activations to the max pool get some share of the gradient - each from a different subset of examples in the batch.


* Locally -> when making only infinitesimal changes to $m$.

** Technically, if $a=b$ exactly then we have a discontinuity, but in practice we can ignore that without issues when training a neural network.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101
2

I have the same question, but I probably figured it out by reviewing the source code of Caffe.

Please see source code of Caffe:

line 620 & 631 of this code.

It calculates derivative of each parameter by adding the derivative (of this parameter) of each input then divides it by batch size.

Also, see line 137 of this code, it simply scales the derivative to 1/iter_size, just the same as the average.

We can see there is NO special treatment for the Max Pooling layer when doing back propagation.

As for the derivative of Max Pooling, let's see the source code of Caffe again:

line 272 of this code. Obviously, only the biggest element's derivative is 1*top_diff, others' derivative is 0*top_diff.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Shaotao Li
  • 21
  • 3