Extract Hindi Text from Scanned Hindi Documetns
Google has developed an Open Source OCR program which supports Hindi also. A good news for all the Hindi users as using this program one can extract Hindi text from scanned Hindi documents.
However, this program does not give 100% correct output & one need to correct the Text but atleast it will save lot of time compared to full text typing.
This program doesn't have Graphical USER Interface and runs from Command Line but it is very simple to use it.
U can visit this link to download the program:
during Installation, you should check a box for downloading HINDI language files
After installation, do the following for extracting Hindi text.
0. Put your scanned Hindi Text image files at C drive e.g. C:\scanned_hindi_text.jpg
1. Click on Start Button (Windows) and click RUN
2. type CMD and press Enter
3. type "CD\" without quotes in command prompt and hit Enter
4. type "cd C:\Program Files\Tesseract-OCR" without quotes in command prompt and hit Enter.
5. use the following syntax:
tesseract c:\scanned_hindi_text.jpg c:\out0001 -l hin
6. This will create a file named "out0001.txt" on C drive which will contain the Hindi Text from the scanned Hindi image.
7. Open the out0001.txt in Notepad and do the necessary corrections.
Here is the sceenshot of extracting Hindi text from a scanned Hindi document:
Have a happy extracting :)