Friday, July 19, 2019

EXTRACT TEXT FROM PDF FILES

EXTRACT TEXT FROM PDF FILES
*Accurate Extraction of Text even with improper Font Encoding*

SOFTWARE REQUIRED

Download & install

PDFlib TET Plugin from here:
http://www.pdflib.com/download/free-software/tet-plugin/

and

PDFlib FontReporter from here: (Optional)
http://www.pdflib.com/download/free-software/fontreporter/

Other Useful Softwares:
SortProgram.exe  from http://vedicastrology.wikidot.com/sort-program
&
HandyFile Find & Replace from http://www.silveragesoftware.com/archive.php

----------------------------------------------------------------------------

Open Adobe Acrobat, click on Plug-Ins Tab & select "PDFLib TET Plugin" & open "TET Configuration". Select Output as TETML & "TETML Destination" as Directory.


A.Create TETML file with PDFLib TET by right click on any page  &  Select "Copy Contents of All Pages"
B. Open TETML file with Notepad & delete  the TEXT till you see <Glyph font...
And save as TEXT file.
C. Open this text file in MS Word


Note : At the bottom of XML file, fonts should be listed, F0 should be the main font, except these, it may be required to remove all other characters of other fonts.

Open Search & Replace Dialog Box with CTRL+H

<check "use wildcards" option for below phrases which have round brackets () >
0.Replace (\<Options\>)(*)(\</Options\>) with NULL
1. Replace (\<Glyph font="F1")(*)(\</Glyph\>) with NULL
2.Replace (\<Glyph font="F2")(*)(\</Glyph\>) with NULL
3.Replace (\<Glyph font="F3")(*)(\</Glyph\>) with NULL
4.Replace (\<Glyph font="F4")(*)(\</Glyph\>) with NULL
[SortProgram.exe can be used to Extract lines containing "F0", then we dont need to go for step no. 0 to 4, and start directly from step no. 5. Remember to convert Carriare Return "\r" & New Line "\n" to "\rn" using HandyFile Find & Replace]
5.Replace (\<Glyph font="F0" size=")(*)(" y=") with NULL
6. Replace (\<Page)*(\>) with NULL
7.Replace (" width=")(*)(" fill="C0"\>) with $$$$$
8.Replace ([0-9]{1,3})(.)([0-9]{1,2}) with \1
9.Replace $$$$$ with ^t
10. Replace <>^p  with NULL
11. Replace (\<)(*)(\>) with NULL
12. Replace ^p^p with ^p and repeat till all ^p^p removed

Copy all text & PASTE in EXCEL (Column A & B)

and in a coloum C use this formula:

=IF(A1<>A2,(IF( ((A1-A2)/3)<12,"<BR>","<P>")),"")

and drag till last row.

Now Copy Column B & C and Paste in Notepad. Save the Text File & Open in Word

Replace ^t with NULL
Replace ^p with NULL

Save the TEXT file and rename its extension to .html & then open with Firefox.
Now from Firefox, select all Text, copy & paste in Notepad. Save the file.




No comments:

Post a Comment