#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0

    Need to capture data between fixed words


    Any help would be appreciated:

    I have the text at the bottom of the post. I need to capture the text between the word 'Summary\r\n' and 'Attachments\r\n'. I do not want 'Summary' or 'Attachments'; I only want the text in between.

    I also need the text between 'Attachments\r\n' and 'Prepared By' and, again, I don't want to include the tags at the beginning and end.

    I have these two expressions, but they won't drop the trailing search patterns and I don't know how to do that and have not been successful searching tutorials and examples.

    (?i)(?s)(?<=Summary\r\n)(.*?)(?:Attachments\r\n)
    (?i)(?s)(?<=Attachments\r\n)(.*?)(?:Prepared by)

    The other thing is I think this could be done with one expression and I could have 2 named groups with the contents, but again, I am not sure how to do that with 'Attachements' being the end of the first and the beginning of the second condition. That would be the best result for me if I could just have 2 groups, either named or 1 and 2.

    I will have many of these reports to scan and the tags should be the same, but there could be page breaks. I want carriage returns, but page breaks should be stripped.

    Text:
    Summary
    A status report on the Transportation Enhancement Program of Projects is
    provided for Board of Directors’ review. Staff recommends approval of the use of
    $434,000 in available FTA 5307 funding for the City of Fullerton’s
    Bastanchury/Valencia Mesa Bike Path Project in order to make other federal
    funds available for the City of Costa Mesa’s Downtown Costa Mesa Gateway
    Improvement Project.
    Attachments
    A. 2010 Transportation Enhancement Program Report
    B. Letter from David Schickling – Assistant City Engineer – City of Fullerton –
    dated January 14, 2013 – 2010 Transportation Enhancement Program
    Funding Exchange
    Prepared by:
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,961
    Rep Power
    9397
    For both, $1 will be just the middle text and without the trailing search string. You're probably looking at $0.

    But yes, it can be done with just one expression.
    Code:
    (?is)(?:Summary\r\n)(.*?)(?:Attachments\r\n)(.*?)(?:Prepared by:\r\n)
    Look at $1 and $2.

    As for page breaks, there is no "page break" character so those must be managed by some other mechanism. Are you sure they're even a problem?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Thanks. This worked in my Regex tester. Much appreciated.

    As for the page breaks, these are PDF files. I have no idea how Adobe codes a page break. Maybe I need to worry about it, maybe I don't. Guess I'l find out.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    22
    Rep Power
    0
    Requinix's suggestion looks good to me--although I didn't test.

    You may want to consider adding ^ and $ to keep it from matching if summary or attachments happen to show up in the text you want to capture.

    Code:
    (?is)(?:^Summary\r\n$)(.*?)(?:^Attachments\r\n$)(.*?)(?:^Prepared by:\r\n)
  8. #5
  9. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,961
    Rep Power
    9397
    Good idea. Be sure to include the /m flag too.
    Code:
    (?ims)...
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Thans for the reply, but for reasons unknown, the ^ and $ did not make the matches. However, looking for the line feeds should keep me pretty unique. These documents are machine generated so I think it's almost impossible to get those line feeds here except at the headings.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Originally Posted by requinix
    Good idea. Be sure to include the /m flag too.
    Code:
    (?ims)...
    Ok, I did that. What will that do for my expression?
  14. #8
  15. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,961
    Rep Power
    9397
    That flag is just for the ^s and $s that acray suggested. You'll need it for those to work.
    Might have to move the $s before the newlines too.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0

    Quirky but works


    Just so we all learn something new today

    This works, with the '$' between the \r and the \n. Nothing else worked. It didn't work with the '$' before the \r or after the \n.

    (?ims)(?:^Summary\r$\n)(.*?)(?:^Attachments\r$\n)

    Interestingly, I played a little more and found I really don't seem to need the \n; the \r alone matches my expression. This surprised me. So \r$ works fine.

    So just out of curiosity, what does the ^ and $ really buy me in this scenario? We,, I mean, yeah, I know the '^' is beginning of line and the '$' is end of line. But in my case, looking for the end of line, I'm not sure I need the '$'. The '^' might help remove ambiguity, though.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    22
    Rep Power
    0
    That would seem to imply your source file ends lines in just \r, rather than \r\n. But that would also imply that \r\n wouldn't have matched in the original expression. So unless there is some quirk with the interpreter... best not to think about it too hard.

    The ^ and $ would just be there to guard against unexpected input.

    For example the following text may not work correctly without the ^ and $

    Code:
    Summary
    A status report on the Transportation Enhancement Program of Projects is
    provided for Board of Directors’ review. Supervisor asked for a new summary
    of the project. Staff recommends approval of the use of
    $434,000 in available FTA 5307 funding for the City of Fullerton’s
    Bastanchury/Valencia Mesa Bike Path Project in order to make other federal
    funds available for the City of Costa Mesa’s Downtown Costa Mesa Gateway
    Improvement Project.
    Attachments
    A. 2010 Transportation Enhancement Program Report
    B. Letter from David Schickling – Assistant City Engineer – City of Fullerton –
    dated January 14, 2013 – 2010 Transportation Enhancement Program
    Funding Exchange
    Prepared by:
    Edit: Hard to tell by looking, but I added the second sentence in the Summary section. There is a line break after "summary"

    It may be unlikely, but if it's based on Human input it may come up.--e.g. someone using one of your keywords in their text and hitting a random <enter> or being the victim of an unfortunate auto line wrap inserting a new line.
    (Unless the source text has been sanitized to remove all line breaks before generating this text.)

    Still worth noting, without sanitized text it's still possible for the match to fail if someone typed in summary or attachment on a single line. But that is even more unlikely.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    22
    Rep Power
    0
    Originally Posted by TomKattt
    As for the page breaks, these are PDF files. I have no idea how Adobe codes a page break. Maybe I need to worry about it, maybe I don't. Guess I'l find out.
    I was thinking about this. PDF's are--to over simplify--just beefed up PostScript files.

    Depending on how your PDFs were created, there may not actually be any page breaks in the file at all. If could be just one block of text the viewer splits into different pages when rendering.

    Or it could be that that the page breaks are there in the "visible" part of the document and you are searching a "plain text(ish)" version that can be embedded into PDFs.

    Or perhaps most likely the regex interpreter that can read PDFs knows to skip all binary/non-ASCII data in the PDF.

    I'm sure I could come up with a few more theories... but if it works on a few samples with the apparent page breaks you were talking about, I probably wouldn't worry about it too much.
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0
    Thx. These are OCR'd PDF's. I have something that works, so I'm happy with the help I have gotten here.
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    22
    Rep Power
    0
    If they are OCR'd then it wouldn't be the first theory I came up with, it could be either of the second two. (OCRing embeds the "plain text(ish)" copy of the document to the scanned image.)

    I may be stating the obvious, but the real reason for this post is a word of caution. Since you're running OCR, I'm assuming they are printed copies getting scanned in. If the scanned image is off in some way (smudge, random error in OCR, etc.) It could mess up your input.

    If you're trying to make a process that doesn't require human review you'll wan to try and make some error catching code after you get your regexp results to flag any failures. The simplest would be to make sure $1 and $2 are non-empty.

    Glad you got it working!
  26. #14
  27. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    7
    Rep Power
    0

    Yes It's Working, Thanks for all the great help


    Thanks, everyone.

    Just FYI, the PDF's are machine generated from database tables. The software I am useing is Capture software designed to process scanned images, pdf's, and other documents and route them elsewhere. I am using the capture software to take these reports, extract that Summary and Attachments infor for metadata, and put these into a SharePoint site.

    This morning I managed to get that capture software to successfully scan the document and capture the fields, so that's one big obstacle out of my way. Next I get to make the migration into Sharepoint work, and my coworker is hitting his head against the wall trying to do that with his project. Hopefully he'll pave the path for me.

    I have some REGEX experience but obviously hit my level of incompetence with these ones so thanks, once again.

IMN logo majestic logo threadwatch logo seochat tools logo