September 28th, 2012, 12:19 AM
Extracting and translating PDFs to text
I want to use python to extract text from PDF files. In particular, I am following the Fukushima disaster and I want to extract the data in their press handouts and other releases which is often only in Japanese. The idea is to translate the Japanese text into English.
an example is here
The route I think is best (correct me if I am wrong please) is to go with pdf2svg program and to create an svg copy of each page.
I have managed to do this but the svg appears to have
a) inline images (which display OK when I 2x click & display the svg file)
b) none of the text I was expecting to find.
Ideally I would like to be able to extract the table (see the second, Japanese pdf example) or at least read the figures as text.
I have figured out how to translate strings (french, spanish, italian) with the Microsoft translator so I assume that IF I can extract the Japanese text, then I will be able to do this for the sections of text too.
1) Am I being naieve in thinking this can be done easily?
2) Are all the svg files (as XML data) rendered as images or points etc?
3) Can the inline images be base64 decoded and displayed as images (or saved as png files) by python using PIL
OS is Debian Squeeze not really interested in Windows at the moment. :-)
Thanks for any help you can give me,
September 28th, 2012, 12:18 PM
I followed the incident at
Hmm. The tepco release (document handouts_120924_02-j.pdf , not the radioactive isotopes) won't even open with evince, and the whatever program runs when I "click it" shows graphs only, no text. xpdf core dumped.
pdftotext produced (what to me looks like a mix of Chinese and Japanese):
good luck, I think your project is possible.
[/code] are essential for python code and Makefiles!
September 28th, 2012, 11:50 PM
a number of these TEPCO handout pdfs give bad object # messages when they are run through some of the pdf to text programs I have tried so I guess that TEPCO use something to generate them which may not produce standard objects or standard PDF output.
Coupled with the fact that some text will probably be in Japanese multibyte characters it would tax most pdf to text programs I guess. This is why I thought it might be better to go through SVG format but now I am likely to get into problems extracting the text from there too.
If it is possible, I wonder if I will find out how before I give up on the project altogether !
on edit: I have found a paid service which may work better than messing about too much myself. The cost is not too much and I will not be using it that much either so may be what I am looking for. for 100 A4 pages the cost as at 29 Sep 12 is 10 dollars (used within 90 days) via their API