8

The objective is to build a model that is capable of identifying information on receipts and invoices that can look completely different.

I've had a discussion with my brother about the right approach. I have attached an example, here the original and below is the important information in boxes:

Original example receipt

Important information

The green boxes are the must-have information. The one in purple and green indicates that we need either or. The orange information would be a nice-to-have, but not necessarily required. Some of the boxes have context and inter linkage.

From a data set point of view, we have a sample size of 1,000 receipts and all of them have the necessary information extracted. We could increase the sample size further if that was required.

The approach that I would have chosen:
Treat everyone of the images of the receipts like a game and let the model figure out itself how to arrive at the right conclusion. This will most likely be very computing intensive but I feel like it will be more robust when dealing with new image types.

The approach my brother has suggested:
Basically using the boxes that I've provided and let me model learn from that. The model would then learn to identify the important areas on a receipt or invoice and would go from there. He compared the model to one that would identify license plates.

Edit: Just to reiterate why I think OCR is only part of the solution but not the solution itself. Here is the Adobe Acrobat OCR result: OCR result

Perfect if you ask me. It just doesn't help me figure out what values to use and which to ignore. I don't want to do this manually. I want the model(s) to return for me:

  • Total amount
  • Sales Tax & Amount
  • Creditor (i.e. the company and ideally the tax identifier CHE-xxxxx MWST
  • Date and ideally time
  • Payment method

Does this make more sense now? I just don't see how OCR gets me there. It will only be the method to extract the values.

Spurious
  • 181
  • 1
  • 3

1 Answers1

1

You should start by trying OCR. I am not sure why you are rejecting it without trying it. Start with Tesseract (free but not very good) and then try a commercial OCR as well (Abbyy is well regarded, but you could also try Adobe Acrobat). Also, research document structure extraction.

That idea that you are going to re-invent a machine learning model that does OCR better than existing OCR solutions seems ... not very realistic. Current OCR tools have years or decades of engineering put into them. There is no way you are going to be able to afford to put that much effort into your own custom tool.

Image quality on those example images is not wonderful, and you might have to accept that accuracy will be less than 100%.

D.W.
  • 3,651
  • 18
  • 43