मनुर्भव: { मनुष्य बनो } Be Humane: Free Hindi OCR software from Google

Tuesday, June 11, 2013

Free Hindi OCR software from Google

Extract Hindi Text from Scanned Hindi Documetns

Google has developed an Open Source OCR program which supports Hindi also. A good news for all the Hindi users as using this program one can extract Hindi text from scanned Hindi documents.

However, this program does not give 100% correct output & one need to correct the Text but atleast it will save lot of time compared to full text typing.

This program doesn't have Graphical USER Interface and runs from Command Line but it is very simple to use it.

U can visit this link to download the program:

http://code.google.com/p/tesseract-ocr

during Installation, you should check a box for downloading HINDI language files

After installation, do the following for extracting Hindi text.

0. Put your scanned Hindi Text image files at C drive e.g. C:\scanned_hindi_text.jpg
1. Click on Start Button (Windows) and click RUN
2. type CMD and press Enter
3. type "CD\" without quotes in command prompt and hit Enter
4. type "cd C:\Program Files\Tesseract-OCR" without quotes in command prompt and hit Enter.
5. use the following syntax:

tesseract c:\scanned_hindi_text.jpg c:\out0001 -l hin

6. This will create a file named "out0001.txt" on C drive which will contain the Hindi Text from the scanned Hindi image.
7. Open the out0001.txt in Notepad and do the necessary corrections.

Here is the sceenshot of extracting Hindi text from a scanned Hindi document:

Have a happy extracting :)

2 comments:

Hanuman Prasad JI poddar-BhaijiJune 16, 2013 at 10:49 PM
आप की APP देखी गूगल प्ले पर गीता जी गोरखपुर वाली बहुत अच्छी है | क्या आप साधक संजीवनी भी अपलोड कर सकते हैं गीता प्रेस की है गीता जी की टीका | राम राम ? मेरी ईमेल id- radha.krishan.bhaiji@gmail.com आप चाहें तो मैं आप को pdf फाइल दे सकता हूँ |
ReplyDelete
Replies
VarunJuly 28, 2013 at 1:53 AM
Thanks a ton.... its really helpful!
ReplyDelete
Replies

Add comment