#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    31
    Rep Power
    3

    Parsing PDF Document


    Hi,

    I am looking to parse a PDF document. The PDF document also has a table from which the data needs to be parsed.

    Has someone done this ?

    I know there are some modules but would appreciate if someone can provide sample code.

    Regards-
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    There are many PDF modules on the CPAN.

    I have used CAM::PDF a couple of times, but I have not idea whether it is the best, simplest, most powerful or best fit to whate you want to do. Look at their documentation to try to figure out which will fit your purpose the best.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    31
    Rep Power
    3
    Thanks - I have been looking into that. But it is not able to parse the pdf document I have. giving all the code in a binary kind of a format.

    If anyone has more info would appreciate it.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2006
    Posts
    147
    Rep Power
    121
    PDF's are tough, there are many things that fall under the PDF::API2 collection of modules and a couple of stand alone things but with so many different ways to create PDF's from so many different applications it can become frustrating to say the least.

    this link has some info but not really answers

    The best bet IMO is still to OCR it. There are a lot of windows OCR programs that do a pretty good job, not so much in the open source world.

    I think tesseract is still the best choice but it often requires training (but if you take the time to train, it works really really well). Then you can use something like PDF::OCR::Thorough to get the text (but it also just uses tesseract so you might still have to train).
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    1
    Rep Power
    0
    Originally Posted by kshahborr1
    Hi,

    I am looking to parse a PDF document. The PDF document also has a table from which the data needs to be parsed.

    Has someone done this ?

    I know there are some modules but would appreciate if someone can provide sample code.

    Regards-
    We had a similar requirement in the past about getting the objects inside PDF files, and ended-up using a commercial toolkit named leadtools. This toolkit contains methods to get the objects inside a PDF File. I think the method is PDFDocument.ParsePages(). You can check their website for more information about this method.

IMN logo majestic logo threadwatch logo seochat tools logo