July 18th, 2012, 11:10 AM
Parsing .docx and .pdf Academic Documents for Citations
I'm new to this forum so please move this if there's a better subforum for it.
I would like to use Python to look through a Word or PDF document and output all the citations it finds. My goal is to use this in academic research. Ideally I'd be able to then sift through this data from multiple documents to find, for example, what books are commonly cited.
As a starting point I'd simply like to be able to run a Python script on a document and have it output all citations to a file - I'm thinking a .csv but I'm open to suggestions. It would have to find citations from foot notes, end notes, or parenthetical in-text citations (so as to accommodate the main citation standards such as Turabian, MLA, and APA.) Once I have a rough script up and running I'd then have to fine tune it for things like looking for the previous citation to determine what an "Ibid." is referring to or checking against a Bibliography in that document to find what a short title or author's name refers to.
For now, I'm simply looking for ways to sift through a Word or PDF file for certain parameters (such as numbers within a foot note and numbers within parentheses.) Any suggestions on how to go about this?
July 18th, 2012, 12:13 PM
I'd use the pdftotext program (haven't looked for a microsoft word to text program) to extract the text from an acrobat file, then (depending on what I found) use gawk (awk or nawk) to extract citations.
And I'd leave python out of the mix.
[/code] are essential for python code and Makefiles!