#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2008
    Posts
    29
    Rep Power
    0

    Extracting and translating PDFs to text


    I want to use python to extract text from PDF files. In particular, I am following the Fukushima disaster and I want to extract the data in their press handouts and other releases which is often only in Japanese. The idea is to translate the Japanese text into English.
    an example is here
    http://www.tepco.co.jp/en/nu/fukushima-np/images/handouts_120925_01-e.pdf (english)
    http://www.tepco.co.jp/nu/fukushima-np/images/handouts_120924_02-j.pdf (Japanese)

    The route I think is best (correct me if I am wrong please) is to go with pdf2svg program and to create an svg copy of each page.
    I have managed to do this but the svg appears to have
    a) inline images (which display OK when I 2x click & display the svg file)
    b) none of the text I was expecting to find.

    Ideally I would like to be able to extract the table (see the second, Japanese pdf example) or at least read the figures as text.
    I have figured out how to translate strings (french, spanish, italian) with the Microsoft translator so I assume that IF I can extract the Japanese text, then I will be able to do this for the sections of text too.

    1) Am I being naieve in thinking this can be done easily?
    2) Are all the svg files (as XML data) rendered as images or points etc?
    3) Can the inline images be base64 decoded and displayed as images (or saved as png files) by python using PIL

    OS is Debian Squeeze not really interested in Windows at the moment. :-)

    Thanks for any help you can give me,
    Paul
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,839
    Rep Power
    480
    I followed the incident at
    http://www.houseoffoust.com/fukushima/

    Hmm. The tepco release (document handouts_120924_02-j.pdf , not the radioactive isotopes) won't even open with evince, and the whatever program runs when I "click it" shows graphs only, no text. xpdf core dumped.


    pdftotext produced (what to me looks like a mix of Chinese and Japanese):

    ...
    福島第一原子力発電所3号機使用済燃料プール内への
    鉄骨滑落事象に関する周辺環境等への影響確認結果

    3号機原子炉建屋上部での瓦礫撤去工事において、使用済燃料プール内に鉄骨1本が滑落する
    事象が発生したことから、周辺環境等への影響の有無について、関連するデータを確認した。
    1.使用済燃料プール周辺での雰囲気線量
    クレーンで吊り下げた線量計を使用済燃料プール上空(9月22日は西側から約3.5m,
    南側から約4.5m)で水面から高さ約2mの位置に設置し、原子炉建屋上部の雰囲気線量を
    測定したところ、事象発生前後で有意な変化は確認されなかった。
    (単位:mSv/h)
    月日
    ...


    good luck, I think your project is possible.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2008
    Posts
    29
    Rep Power
    0
    a number of these TEPCO handout pdfs give bad object # messages when they are run through some of the pdf to text programs I have tried so I guess that TEPCO use something to generate them which may not produce standard objects or standard PDF output.

    Coupled with the fact that some text will probably be in Japanese multibyte characters it would tax most pdf to text programs I guess. This is why I thought it might be better to go through SVG format but now I am likely to get into problems extracting the text from there too.

    If it is possible, I wonder if I will find out how before I give up on the project altogether !

    on edit: I have found a paid service which may work better than messing about too much myself. The cost is not too much and I will not be using it that much either so may be what I am looking for. for 100 A4 pages the cost as at 29 Sep 12 is 10 dollars (used within 90 days) via their API

IMN logo majestic logo threadwatch logo seochat tools logo