9

I'm trying to build a model that is capable of identifying information on receipts and invoices.

I have used google cloud vision api for text extraction from the receipt but the problem is it just returns all the text from a receipt. I am looking to build a model that returns only a certain fields such as total price, brought alcohol etc.. from a receipt.

I could parse the text to extract by hard coding things but it's not optimal I think. Is there any way to build something for this use case. I need just something to go on.

Comment --- Output:

'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg@bluewin.ch\n'
user_12
  • 347
  • 3
  • 10

1 Answers1

8

The simplest pipeline would be to do the following:

  1. OCR
  2. Named Entity Extraction
  3. Entity Disambiguation

OCR

This is basically transforming your receipts into plain text.

If you have scans (pictures) of the receipts, then you need a method that can deal with images. For example, you could use tesseract.

Named Entity Extraction

This is detecting the parts of interest in the text. This includes detecting dates, prices, currency, locations, names, and so on. Once you detect a named entity, for example: price=2.59, you still don't know if that is the unit cost, or total cost after tax, or total cost before tax, but that will come in the next step. All you know that this is an area of interest.

There are tools that are able to detect most named entities that you might want to use. For example: spaCy. The cool thing about spaCy is that you can train it to recognize more named entities that you might want.

Entity disambiguation

This is identifying the purpose of the named entity that you extracted. For example, given price=2.59, is it the unit cost, or total cost after tax, or total cost before tax.

This is where rules might come in handy, because it will be difficult to teach a machine learning model to identify them. You can have rules like:

  1. highest value = total cost after tax
  2. unity quantity × unit price = total cost before tax
  3. total cost before tax × 1.21 = total cost after tax
  4. and so on...
Bruno Lubascher
  • 3,618
  • 1
  • 14
  • 36