#1
  1. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,927
    Rep Power
    378

    Extracting info from PDFs


    I'm thinking about automating the process of extracting information from PDF'd CAD Drawings. The drawings have title blocks with information about project, drawing title, scale, revisions, etc., and the drawings can come in various sizes - A4, A3,etc.
    I'm not interested in the content of the drawing per se, just the title block.
    The design of the the title block can be amended if necessary, and I can construct the PDFs in a variety of ways so there may be options on including custom metadata too.

    Before I go any further, does anyone have any initial thoughts on this kind of thing?
  2. #2
  3. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,300
    Rep Power
    7170
    One approach would be to see if you can find a Linux command line utility that's capable of dumping a PDF file into a text document, then parse the text.

    I think there are PHP libraries that can read PDFs natively too, but it's been a while since I've used them and I can't remember their names. tcpdf might be capable of doing it. I know for one project I used a PDF library that was able to read and merge PDFs, although I'm not sure if it provided an interface for extracting text.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,927
    Rep Power
    378
    Hi - sorry for taking so long to respond - I missed your reply somehow.

    I'm using php on window$. The utility will only be hosted locally so there should be no issue with that.

    Here's where I've got so far:

    I carefully restructured the title block so that the data that is extracted in PHP (using pdftotext) is fairly 'well-formed'. This works on a 'blank' drawing, i.e. one that only has title block content and no other text content.

    The problem arises when introducing other text to the drawing. Pdftotext cannot distinguish between drawing content and title block content so just mixes it all together.

    One idea would be to only use fonts for drawing content that are parsed as vectors by pdf. However, this would be incredibly difficult to control as text data comes from all over the place.

    So, another idea is to confine the 'scan area' to the bottom right hand corner of the page. The scan area would always be a rectangle measuring 1/8th of the width of the page by 2/5ths of its height.

    I suspect that this can be done with PDFLib but I'm not sure how - but perhaps this question is better directed to a dedicated PDFLib forum (if such a thing exists)?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2008
    Posts
    1,927
    Rep Power
    378
    Just following up on this for the benefit of others...

    I discovered that poppler extends xpdf's pdftotext tool, allowing the user to specify coordinates for extraction.

    With some careful reorganisation of the title block, it works like a charm!

    Now to implement some kind of drag-drop-upload feature, but I guess that kind of thing's been discussed already...

IMN logo majestic logo threadwatch logo seochat tools logo