Using python and machine learning to extract information from an invoice? Inital dataset?

Question

DISCLAIMER: I have absolutely no background with machine learning/data science, and am unfamiliar with the general lingo of data science, so please bear with me.

I'm trying to make a machine learning application with Python to extract invoice information (invoice number, vendor information, total amount, date, tax, etc.). As of right now, I'm using the Microsoft Vision API to extract the text from a given invoice image, and organizing the response into a top-down, line-by-line text document in hopes that it might increase the accuracy of my eventual machine learning model. My current situation is strictly using string parsing, and this method works pretty well for invoice number, data, and total amount.

However, when it comes to vendor information (name, address, city, province, etc.) string parsing hardly ever works due to my script relying on the detection of an address (because addresses have a pretty distinct format compared to the rest of the information on an invoice). Tax information is difficult to parse because of the amount of numeric values that appear on an invoice.

So (and here's where I get lost) I envision a machine learning model that will have an input of a single invoice image from a user, and it's output will be the extracted invoice information. I am currently looking into Azure Machine Learning Studio because the plug 'n play aspect of it appeals to me and it seems easy enough to use and experiment with. BUT I have no clue what the requirements are for an initial dataset! Should I just fill my dataset (in a CSV format btw) with the necessary information (invoice number, total amount, date, ...) from a bunch of sample invoices? If not, what other information should I include in my dataset? I was thinking x-y coordinate pairs of where the important information occurs on the image.

One last question related to this problem scope, which algorithm (regression, classification, clustering) could even "extract" (or help with it) information from the input text? As far as I know, regression predicts numeric values (i.e. 2 + x = 10 so 8 + x could be 16) and classification "labels" things (i.e. this image contains a tree, this one does not). I'm not too familiar with clustering, although I think it could be useful to identify the structure of the input text.

To summarize: what might be some features of an invoice that I can fill an initial dataset with to initialize a model? How could a clustering algorithm be used to identify the structure of an invoice? How could I utilize a regression/classification algorithm in this problem, and what inputs would these algorithms need to deliver meaningful results?

Any help/guidance/feedback would be appreciated to the moon and back. Sorry for my lack of knowledge, but this field is very interesting and need some help wrapping this all around my head. Thanks:)

score 2 · Answer 1 · answered Jun 15 '18 at 19:30

I'm more of a beginner as well, but wanted to possibly help guide you towards next steps based on some of my experiences.

First, what you've illustrated here sounds a little along the lines of how a traditional EDI system does in Supply Chain & Procurement processing, where it will receive a document and enter the information into transaction-based operational systems to create records as data. I'm not entirely sure how those work, and if they can only extract standardized formats from documents but that's something that you could possibly look into with a few searches.

Now as far as creating a CSV data set, that is a great idea for testing the accuracy of your algorithm on a set of invoices to train your model. Training a model is pretty self-explanatory, but essentially you'd be using a supervised machine learning strategy, where the system actively uses a training data set where the correct answers are known. By comparing the models results to the data set, you could then know if the algorithm is appropriately retrieving the information. I'd recommend that you include the information the application should be getting from each invoice, and compare that to what the machine did retrieve. That way you effectively known the error of the model. (x,y) coordinates would work well for preliminary attempts to get a better feel for location-based assessment of the invoices, so that's a pretty good idea in my opinion as well.

Regarding algorithm types, this one will almost always be "it depends". Many times, I'll use a couple and compare the error that returns for each model to find the one that works the best for what I'm predicting. That said, Clustering can be useful for continuous, numerical data and "grouping" items based on the location of the coordinates. I think clustering could be useful if you're breaking down the coordinates of the information on an image, but wouldn't be of much use to gather and extract the text from the invoice. And to not go too deep, K-Means would probably not be great at all because the locations for the cluster centroids is entirely randomized and you must also declare a number of n clusters, which would be impracticable for the purpose at hand.

I'd recommend looking into the different types of Neural Networks(NN's). These are used extensively to take images of some kind, and break them down to best estimate the contents of an unknown input. It's the type of algorithm Google uses to do image searches and things along those lines. It's also the type of software that tells an autonomous vehicle that the sign ahead is a stop sign. There are different types of NN's out there, and it can get pretty overwhelming pretty quickly, so I recommend looking into basic videos to explain it. If you do decide to go this route, Google Cloud has a pretty solid platform with there google.vision and other machine-learning-in-a-box libraries that are really easy to import and play with.

as far as inputs, define the information that you want extracted, and record that data for use of testing and training as I mentioned above. Then play around with a few methods in sciket-learn and TensorFlow!

Unfortunately, I can't give you a straight-forward answer on the many trade-offs with different algorithms or inputs, but I hope that this gave you a few ideas about next steps or provided you with a better understanding of how you can begin the project. Good luck!

score 1 · Answer 2 · edited Jun 30 '18 at 19:19

Invoice processing has been evolved over time and place. For big corporations, their finance department may require vendor to put purchase order number on the invoice. The purchase order number might have a fixed format, e.g. PO-0001. So while extracting the invoice, all it has to do is to find that pattern. Once you have the PO number, a database lookup will return all information you need.

The next phase, if you have limited set of vendors, you can define templates for each vendor's invoice. For example, the upper-left corner has vendor name. It will help you to pinpoint the information you need to extract. When there's unknown invoice, some human operator has to create that template. Some commercial applications fall into this category. If you have the resource of human operator, that can be an option.

Similarly, using machine learning, you can generate clusters based on the layout of those training invoices. When a new invoice arrived, you can try to match the corresponding cluster / template. This approach has limitation for handling unknown invoice.

There are some machine learning based software as a service (SaaS) providers out there, such as mlreader . It's more sophisticated than above. If you don't have to build it in-house, you can give that a try. It could be more cost effective.

score 0 · Answer 3 · answered Jun 30 '18 at 09:54

You're taking on a really hard problem. Maybe try to break it down into easier sub-problems like:

use a solution like Ms Vision of equivalent from Google or AWS to extract text/numeric data from invoice images
figure out some categories you can split you invoices into (maybe by country and/or company type?)
train some classifier (whatever trainable image classifier you find easier to configure and use, probably there's some easy to use one in Azure/AWS/Google machine learning APIs) to classify the invoices, by using their images not the previously extracted text data
then have a "final model" that can be tuned to take into consideration stuff different for each "invoice class", that could contain even hand coded features and heuristics specific for eg. "invoices from country X", that takes as input test/numeric data extracted at first step plus invoice class classified by the image classifier

The intuition is that there are probable N max invoicing software providers for each geographic are, and those software have an max M number of output configurations, so there should be a bunch of heuristic shortcuts you can exploit around here and incorporate into your step 4 model.

But what that step 4 model should be I cannot tell, so this is just a bunch of hints, not a real answer, I know...

And about:

I was thinking x-y coordinate pairs of where the important information occurs on the image.

I can just tell you this approach is probably bad, never heard of anyone getting anywhere with "extracted data + (x, y) of where it was extracted from image" approach. You either take on yourself the full task of going from [image data] to [extracted numeric features] which would be a gargantuan, practically impossible task (you'd basically need to reinvent a version of what Azure ML does, only that it would be specifically optimized for you task of invoice data extraction).

Using python and machine learning to extract information from an invoice? Inital dataset?

3 Answers3