Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old July 18th, 2012, 10:10 AM
Qanthelas Qanthelas is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2012
Posts: 1 Qanthelas User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 13 m 11 sec
Reputation Power: 0
Parsing .docx and .pdf Academic Documents for Citations

I'm new to this forum so please move this if there's a better subforum for it.

I would like to use Python to look through a Word or PDF document and output all the citations it finds. My goal is to use this in academic research. Ideally I'd be able to then sift through this data from multiple documents to find, for example, what books are commonly cited.

As a starting point I'd simply like to be able to run a Python script on a document and have it output all citations to a file - I'm thinking a .csv but I'm open to suggestions. It would have to find citations from foot notes, end notes, or parenthetical in-text citations (so as to accommodate the main citation standards such as Turabian, MLA, and APA.) Once I have a rough script up and running I'd then have to fine tune it for things like looking for the previous citation to determine what an "Ibid." is referring to or checking against a Bibliography in that document to find what a short title or author's name refers to.

For now, I'm simply looking for ways to sift through a Word or PDF file for certain parameters (such as numbers within a foot note and numbers within parentheses.) Any suggestions on how to go about this?

Reply With Quote
  #2  
Old July 18th, 2012, 11:13 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,458 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 4 Days 6 h 26 m 43 sec
Reputation Power: 403
I'd use the pdftotext program (haven't looked for a microsoft word to text program) to extract the text from an acrobat file, then (depending on what I found) use gawk (awk or nawk) to extract citations.

And I'd leave python out of the mix.
__________________
[code]Code tags[/code] are essential for python code!

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Parsing .docx and .pdf Academic Documents for Citations

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap