11

ORIGINAL IMAGE:

enter image description here

GOAL:

enter image description here

I want to separate texts into individual paragraphs by placing bounding boxes over them (as shown above).

I tried it do this via traditional computer vision approach using opencv.

  1. I plotted character level bounding box
  2. Next, I gray-scaled the image, binarized it.
  3. Applied dilation
  4. And finally placed bbox over the dilated image.

This is what I get:

enter image description here

> #Morphological Transformation

kernel = np.ones((3,4),np.int8)

dilation = cv2.dilate(im_bw, kernel)

cv2.imwrite('dilated.png', dilation)

Plotting rectangular box

ret,thresh = cv2.threshold(im_bw, 127,255,0)
image, contours,hierarchy = cv2.findContours(thresh,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE )

for c in contours:
    rect = cv2.boundingRect(c)
    if rect[2] < 50 or rect[3] < 50 : continue

    print (cv2.contourArea(c))
    x,y,w,h = rect
    cv2.rectangle(im_new,(x,y),(x+w,y+h),(0,255,0),2)

cv2.imwrite('sample_res_inner.jpg',im_new)

Since the image is scanned image plus the line-spaces between them are small I couldn't able to segment them based on paragraphs.

How can I get my desired result?

DGS
  • 301
  • 1
  • 3
  • 7

2 Answers2

3

There are two options :

  1. Scan the images with a higher DPI. This should accentuate vertical separation between paragraphs.
  2. Train a Deep learning model for Text Detection in scene. Examples : https://github.com/qjadud1994/CRNN-Keras and https://github.com/mvoelk/ssd_detectors
Shamit Verma
  • 2,319
  • 1
  • 10
  • 14
2

Have you looked into Tesseract (and its Python wrapper/interface: pytesseract)? I don't guarantee that it will solve your problems entirely, but it offers bounding box and OCR features.

On this Tesseract site it lists possible page segmentation modes that you could play around with.

There is also this page that provides some quality improvement suggestions.

There are many questions/answers on Stack Overflow about specific usage cases. In this answer, for example, there is a recommendation to use OSM mode to detect multiple columns.

And there is this SO answer that offers a way to break text into paragraphs.

AlexK
  • 350
  • 2
  • 12