2

There are a million and one examples and tutorials on how to train up a neural network on the sample sets like the MNIST data and the CIFAR-10 data, but how does one go from the toy examples of recognising 200x200 clips each containing a single centred object to a real problem like finding CIFAR-10 category objects (the dog and the cat below) within a picture, like I presume Google does for their photo annotation.

enter image description here

Can someone describe how one might approach this leap from the classroom to the real world?

Ken Y-N
  • 131
  • 4

1 Answers1

1

This is a well defined problem called text spotting. There are numerous avenues to tackle this problem but most of the good ones are based on deep learning. Naively you would use a network like the one you trained on MNIST to slide over your input and see where it fires strongly to build up a string. This approach works reasonably well but to convolute this over your whole input image is extremely computationally expensive. A way that is actually used in practice is a two step process, first there is a network that is trained in localizing areas of interest, which are bounding boxes of parts where there might be texts and then use a more advanced network to grab the text. To my understanding this is also done in one network pass nowadays as opposed to querying on single patches. If you search for text spotting you will find a lot of nice papers and a thesis, mostly from the same guy.

http://www.mathstat.dal.ca/~hgu/Neural%20Comput%20&%20Applic.pdf

Jan van der Vegt
  • 9,448
  • 37
  • 52