DISCLAIMER: I have absolutely no background with machine learning/data science, and am unfamiliar with the general lingo of data science, so please bear with me.
I'm trying to make a machine learning application with Python to extract invoice information (invoice number, vendor information, total amount, date, tax, etc.). As of right now, I'm using the Microsoft Vision API to extract the text from a given invoice image, and organizing the response into a top-down, line-by-line text document in hopes that it might increase the accuracy of my eventual machine learning model. My current situation is strictly using string parsing, and this method works pretty well for invoice number, data, and total amount.
However, when it comes to vendor information (name, address, city, province, etc.) string parsing hardly ever works due to my script relying on the detection of an address (because addresses have a pretty distinct format compared to the rest of the information on an invoice). Tax information is difficult to parse because of the amount of numeric values that appear on an invoice.
So (and here's where I get lost) I envision a machine learning model that will have an input of a single invoice image from a user, and it's output will be the extracted invoice information. I am currently looking into Azure Machine Learning Studio because the plug 'n play aspect of it appeals to me and it seems easy enough to use and experiment with. BUT I have no clue what the requirements are for an initial dataset! Should I just fill my dataset (in a CSV format btw) with the necessary information (invoice number, total amount, date, ...) from a bunch of sample invoices? If not, what other information should I include in my dataset? I was thinking x-y coordinate pairs of where the important information occurs on the image.
One last question related to this problem scope, which algorithm (regression, classification, clustering) could even "extract" (or help with it) information from the input text? As far as I know, regression predicts numeric values (i.e. 2 + x = 10 so 8 + x could be 16) and classification "labels" things (i.e. this image contains a tree, this one does not). I'm not too familiar with clustering, although I think it could be useful to identify the structure of the input text.
To summarize: what might be some features of an invoice that I can fill an initial dataset with to initialize a model? How could a clustering algorithm be used to identify the structure of an invoice? How could I utilize a regression/classification algorithm in this problem, and what inputs would these algorithms need to deliver meaningful results?
Any help/guidance/feedback would be appreciated to the moon and back. Sorry for my lack of knowledge, but this field is very interesting and need some help wrapping this all around my head. Thanks:)