How can I detect blocks of text from scanned document images

Question

ORIGINAL IMAGE:

GOAL:

I want to separate texts into individual paragraphs by placing bounding boxes over them (as shown above).

I tried it do this via traditional computer vision approach using opencv.

I plotted character level bounding box
Next, I gray-scaled the image, binarized it.
Applied dilation
And finally placed bbox over the dilated image.

This is what I get:

> #Morphological Transformation

kernel = np.ones((3,4),np.int8)

dilation = cv2.dilate(im_bw, kernel)

cv2.imwrite('dilated.png', dilation)

Plotting rectangular box

ret,thresh = cv2.threshold(im_bw, 127,255,0)
image, contours,hierarchy = cv2.findContours(thresh,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE )

for c in contours:
    rect = cv2.boundingRect(c)
    if rect[2] < 50 or rect[3] < 50 : continue

    print (cv2.contourArea(c))
    x,y,w,h = rect
    cv2.rectangle(im_new,(x,y),(x+w,y+h),(0,255,0),2)

cv2.imwrite('sample_res_inner.jpg',im_new)

Since the image is scanned image plus the line-spaces between them are small I couldn't able to segment them based on paragraphs.

How can I get my desired result?

score 3 · Answer 1 · answered Mar 15 '19 at 08:46

There are two options :

Scan the images with a higher DPI. This should accentuate vertical separation between paragraphs.
Train a Deep learning model for Text Detection in scene. Examples : https://github.com/qjadud1994/CRNN-Keras and https://github.com/mvoelk/ssd_detectors

AlexK · Answer 2 · 2019-03-25T23:34:41.867

Have you looked into Tesseract (and its Python wrapper/interface: pytesseract)? I don't guarantee that it will solve your problems entirely, but it offers bounding box and OCR features.

On this Tesseract site it lists possible page segmentation modes that you could play around with.

There is also this page that provides some quality improvement suggestions.

There are many questions/answers on Stack Overflow about specific usage cases. In this answer, for example, there is a recommendation to use OSM mode to detect multiple columns.

And there is this SO answer that offers a way to break text into paragraphs.

How can I detect blocks of text from scanned document images

Plotting rectangular box

2 Answers2