9

I have extracted text data from pdf files of annual reports of companies using pdftotext. The extracted file content looks like: Sample pdf file is here

FORWARD-LOOKING STATEMENTS

In this Annual Report, we have disclosed forward-looking information to enable investors to

comprehend our prospects and take investment decisions. This report and other statements written and oral – that we periodically share contain

forward-looking statements that set out ...//Content Continued like in this format until the paragraph ends//

HERO O FOREVER

Now if check out the actual report on above link the text under Notes to Account and Independent Auditor's Report follows the almost similar structure in all companies reports in India. Only the Chairman's message or Boards reports vary but usually talks about growth, performance, future, investments etc.

So is there any way to extract only the paragraphs/multiple paragraphs combines into single(if continuation of same information) which contains useful information. I have searched but i find most of work on paragraph/document summarization but donot find something like extraction of actual continuous blocks of text data from documents. Note : There is lot of noisy data(actual data in pdf but they are like page number, if company added some text in header/footer apart from title of page etc) between paragraphs which is generated while converting from pdf to text like:

Multi-line paragraph R

N

O

I

WE W DS REMA INDS

BRANVANT IN M

RELE MARKETSDES

AND SS DECA

ACRO

6

Next Multiline paragraph

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51
Sanjeev
  • 191
  • 1
  • 1
  • 4

1 Answers1

2

It's not always possible to extract paragraphs from a pdf since sometime paragraph are split into multiple pdf frames so pdftotext split them into different paragraph even if there are actually linked. Similarly some frames ends collocated even they represent different information like the menu in the example pdf.

Here is a simple approach to split a text file into multiple paragraph using empty lines:

def txt2paragraph(filepath):
    with open(filepath) as f:
        lines = f.readlines()

    paragraph = ''
    for line in lines:
        if line.isspace():  # is it an empty line?
            if paragraph:
                yield paragraph
                paragraph = ''
            else:
                continue
        else:
            paragraph += ' ' + line.strip()
    yield paragraph
amirouche
  • 201
  • 2
  • 9