3

My question is primarily: is there any ML research paper about splitting a pdf containing a batch of scanned documents (eg bank statements) into individual documents?

I have searched for this but I have not found any relevant research paper or any application in general mentioned on the Internet.

I would be primarily interested in the feature engineering of these papers/applications but also in general in the whole approach.

Valentin Calomme
  • 6,256
  • 3
  • 23
  • 54
Outcast
  • 1,117
  • 3
  • 14
  • 29

2 Answers2

2

"Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction" by Esposito, Ferilli, Basile, and Mauro goes into detail about how to create a custom system for parsing digital documents, including pdfs. It proposes a generalized process to learn any structure within documents.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

After Working a lot with pdf like documents bank statements and the sorts the 3 major conference/workshops you want to see is ICDAR, DocEng and the Document Intelligence workshop at NeuralIPS.

The chance of the paper you are looking for not being here is really low as these 3 are probably the biggest document research places. I have been going through the for a few months now and I cant find a mention of what you are looking for.

A simple model that classifies pages based on text on the page as ending page might work but no guarantee.

Such topics are an important part of Machine learning with documents research.