March 26th, 2013, 05:59 PM
Parsing PDF Document
I am looking to parse a PDF document. The PDF document also has a table from which the data needs to be parsed.
Has someone done this ?
I know there are some modules but would appreciate if someone can provide sample code.
March 27th, 2013, 02:26 AM
There are many PDF modules on the CPAN.
I have used CAM::PDF a couple of times, but I have not idea whether it is the best, simplest, most powerful or best fit to whate you want to do. Look at their documentation to try to figure out which will fit your purpose the best.
March 28th, 2013, 05:57 PM
Thanks - I have been looking into that. But it is not able to parse the pdf document I have. giving all the code in a binary kind of a format.
If anyone has more info would appreciate it.
March 31st, 2013, 09:49 AM
PDF's are tough, there are many things that fall under the PDF::API2 collection of modules and a couple of stand alone things but with so many different ways to create PDF's from so many different applications it can become frustrating to say the least.
this link has some info but not really answers
The best bet IMO is still to OCR it. There are a lot of windows OCR programs that do a pretty good job, not so much in the open source world.
I think tesseract is still the best choice but it often requires training (but if you take the time to train, it works really really well). Then you can use something like PDF::OCR::Thorough to get the text (but it also just uses tesseract so you might still have to train).
November 14th, 2013, 08:06 AM
We had a similar requirement in the past about getting the objects inside PDF files, and ended-up using a commercial toolkit named leadtools. This toolkit contains methods to get the objects inside a PDF File. I think the method is PDFDocument.ParsePages(). You can check their website for more information about this method.
Originally Posted by kshahborr1