
September 27th, 2012, 11:19 PM
|
|
Registered User
|
|
Join Date: Jul 2008
Posts: 23
Time spent in forums: 4 h 50 m 15 sec
Reputation Power: 0
|
|
|
Extracting and translating PDFs to text
I want to use python to extract text from PDF files. In particular, I am following the Fukushima disaster and I want to extract the data in their press handouts and other releases which is often only in Japanese. The idea is to translate the Japanese text into English.
an example is here
http://www.tepco.co.jp/en/nu/fukushima-np/images/handouts_120925_01-e.pdf (english)
http://www.tepco.co.jp/nu/fukushima-np/images/handouts_120924_02-j.pdf (Japanese)
The route I think is best (correct me if I am wrong please) is to go with pdf2svg program and to create an svg copy of each page.
I have managed to do this but the svg appears to have
a) inline images (which display OK when I 2x click & display the svg file)
b) none of the text I was expecting to find.
Ideally I would like to be able to extract the table (see the second, Japanese pdf example) or at least read the figures as text.
I have figured out how to translate strings (french, spanish, italian) with the Microsoft translator so I assume that IF I can extract the Japanese text, then I will be able to do this for the sections of text too.
1) Am I being naieve in thinking this can be done easily?
2) Are all the svg files (as XML data) rendered as images or points etc?
3) Can the inline images be base64 decoded and displayed as images (or saved as png files) by python using PIL
OS is Debian Squeeze not really interested in Windows at the moment. :-)
Thanks for any help you can give me,
Paul
|