Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old September 27th, 2012, 11:19 PM
ocpaul20 ocpaul20 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2008
Posts: 23 ocpaul20 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 50 m 15 sec
Reputation Power: 0
Extracting and translating PDFs to text

I want to use python to extract text from PDF files. In particular, I am following the Fukushima disaster and I want to extract the data in their press handouts and other releases which is often only in Japanese. The idea is to translate the Japanese text into English.
an example is here
http://www.tepco.co.jp/en/nu/fukushima-np/images/handouts_120925_01-e.pdf (english)
http://www.tepco.co.jp/nu/fukushima-np/images/handouts_120924_02-j.pdf (Japanese)

The route I think is best (correct me if I am wrong please) is to go with pdf2svg program and to create an svg copy of each page.
I have managed to do this but the svg appears to have
a) inline images (which display OK when I 2x click & display the svg file)
b) none of the text I was expecting to find.

Ideally I would like to be able to extract the table (see the second, Japanese pdf example) or at least read the figures as text.
I have figured out how to translate strings (french, spanish, italian) with the Microsoft translator so I assume that IF I can extract the Japanese text, then I will be able to do this for the sections of text too.

1) Am I being naieve in thinking this can be done easily?
2) Are all the svg files (as XML data) rendered as images or points etc?
3) Can the inline images be base64 decoded and displayed as images (or saved as png files) by python using PIL

OS is Debian Squeeze not really interested in Windows at the moment. :-)

Thanks for any help you can give me,
Paul

Reply With Quote
  #2  
Old September 28th, 2012, 11:18 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,372 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 11 h 39 m 38 sec
Reputation Power: 383
I followed the incident at
http://www.houseoffoust.com/fukushima/

Hmm. The tepco release (document handouts_120924_02-j.pdf , not the radioactive isotopes) won't even open with evince, and the whatever program runs when I "click it" shows graphs only, no text. xpdf core dumped.


pdftotext produced (what to me looks like a mix of Chinese and Japanese):

...
福島第一原子力発電所3号機使用済燃料プール内への
鉄骨滑落事象に関する周辺環境等への影響確認結果

3号機原子炉建屋上部での瓦礫撤去工事において、使用済燃料プール内に鉄骨1本が滑落する
事象が発生したことから、周辺環境等への影響の有無について、関連するデータを確認した。
1.使用済燃料プール周辺での雰囲気線量
クレーンで吊り下げた線量計を使用済燃料プール上空(9月22日は西側から約3.5m,
南側から約4.5m)で水面から高さ約2mの位置に設置し、原子炉建屋上部の雰囲気線量を
測定したところ、事象発生前後で有意な変化は確認されなかった。
(単位:mSv/h)
月日
...


good luck, I think your project is possible.
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
  #3  
Old September 28th, 2012, 10:50 PM
ocpaul20 ocpaul20 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2008
Posts: 23 ocpaul20 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 50 m 15 sec
Reputation Power: 0
a number of these TEPCO handout pdfs give bad object # messages when they are run through some of the pdf to text programs I have tried so I guess that TEPCO use something to generate them which may not produce standard objects or standard PDF output.

Coupled with the fact that some text will probably be in Japanese multibyte characters it would tax most pdf to text programs I guess. This is why I thought it might be better to go through SVG format but now I am likely to get into problems extracting the text from there too.

If it is possible, I wonder if I will find out how before I give up on the project altogether !

on edit: I have found a paid service which may work better than messing about too much myself. The cost is not too much and I will not be using it that much either so may be what I am looking for. for 100 A4 pages the cost as at 29 Sep 12 is 10 dollars (used within 90 days) via their API

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Extracting and translating PDFs to text

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap