How does the bounding box regressor work in Fast R-CNN?

Question

In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent the probability of output classes. Rather, they are physical values representing position and size of a bounding box.

The exact method of how this regression learning happens is not clear to me. Linear regression and image classification by deep learning are well explained separately earlier. But how the linear regression algorithm works in the CNN settings is not explained so clearly.

Can you explain the basic concept for easy understanding?

score 6 · Answer 1 · answered Apr 06 '19 at 18:41

A very clear and in-depth explanation is provided by the slow R-CNN paper by Author(Girshick et. al) on page 12: C. Bounding-box regression and I simply paste here for quick reading:

Moreover, the author took inspiration from an earlier paper and talked about the difference in the two techniques is below:

After which in Fast-RCNN paper which you referenced to, the author changed the loss function for BB regression task from regularized least squares(ridge regression) to smooth L1 which is less sensitive to outliers!. Also, you embed this smooth L1 loss in the multi-task loss function so that we can jointly train for classification and bounding-box regression that wasn't done before in R-CNN or SPP-net!

However, the same author has changed the loss function again in the upcoming paper faster-RCNN Later, in FCN Many a time, in order to learn about a topic, you need to do backtracking through research papers! :) Hope it helps!

score 4 · Answer 2 · answered Apr 20 '18 at 07:58

The paper cited does not mention linear regression at all. What it does is using a neural network to predict continuous variables, and refers to that as regression.

The regression that is defined (which is not linear at all), is just a CNN with convolutional layers, and fully connected layers, but in the last fully connected layer, it does not apply sigmoid or softmax, which is what is typically used in classification, as the values correspond to probabilities. Instead, what this CNN outputs are four values $(r, c, h, w)$, where $(r, c)$ specify the values of the position of the left corner and $(h, w)$ the height and width of the window. In order to train this NN, the loss function will penalize when the outputs of the NN are very different from the labelled $(r, c, h, w)$ in the training set.

How does the bounding box regressor work in Fast R-CNN?

2 Answers2