ArabicOCR — amazing OCR library for Arabic pdf documents

Shekhar Khandelwal
4 min readDec 12, 2021

--

There are many OCR libraries out there like tessaract, easy-ocr and keras-ocr, to name a few. All of them works quite well on English language. But not all works as accurate & smooth on other languages like Arabic etc.

In my recent work, I came across a problem statement where I need to first identify whether the pdf data that streams is an editable one or non-editable one. In either of the cases, we need to extract whole pdf content for further data analytics.

For non-ediatable pdf’s, I needed an OCR library that can extract Arabic content from the pdf accurately. That’s when I came across this amazing python OCR library which is specifically built for Arabic language, called ArabicOCR.

Official reporsitory — ArabicOcr · PyPI

Sample tutorial — Google Colab

Now, usually if its a non-ediatble pdf, it usually means that the image has been converted into a pdf format. And in industrial setup, you will usually get the document in a pdf format, not a jpg or png format.

Here is a sample pdf document.

First thing is to convert the document to a png/jpg format.

For this refer this article where I have explained about another amazing python library which deals with pdf documents, and we will do a lot of amazing things with this library on pdf data.

PyMuPDF — amazing python library for pdf data — Shekhar Khandelwal — Medium

Official PyMuPDF documentation — PyMuPDF Documentation — PyMuPDF 1.19.2 documentation

First lets import PyMuPDF library, and convert the pdf to an image.

pdf="arabic_image.pdf"
import sys, fitz
doc = fitz.open(pdf) # open document
for page in doc: # iterate through the pages
pix = page.get_pixmap() # render page to an image
pix.save("page-%i.png" % page.number) # store image as a PNG

Now since, for this example, the pdf had only 1 page, hence only 1 image will be generated. Else, with the above code, as many number of images will be generated as many numbers of pages in the pdf.

Here is the converted image of the pdf —

Now, let’s start with installing the ArabicOCR package -

!pip install ArabicOcr

Import the package in your program

from ArabicOcr import arabicocr

Using the image file, use the below code to extract the arabic textual data.

image_path='page-0.png'
out_image='out.jpg'
results=arabicocr.arabic_ocr(image_path,out_image)

In the console, you can see the output something like this —

Result will be a list of lists which contain both the extracted arabic text as well as their location.

print(results) 

Let’s get the extracted text into a file for further processing of the data.

words=[]
for i in range(len(results)):
word=results[i][1]
words.append(word)
with open ('file.txt','w',encoding='utf-8')as myfile:
myfile.write(str(words))

Similary, we can get the locations of the text from results.

annotations=[]
for i in range(len(results)):
annotation=results[i][0]
annotations.append(annotation)
with open ('annotations.txt','w',encoding='utf-8')as myfile:
myfile.write(str(annotations))

Finally, the code will also produce the resulting image with annotations of every word in the document.

You can use opencv to read the annotated image.

import cv2
import matplotlib.pyplot as plt
img = cv2.imread('out.jpg', cv2.IMREAD_UNCHANGED)
plt.figure(figsize=(10,10))
plt.imshow(img)

Github link — shekharkhandelwal1983/ArabicOCR (github.com)

Thanks & Happy Learning !

--

--

Shekhar Khandelwal

Data Scientist with a majors in Computer Vision. Love to blog and share the knowledge with the data community.