#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    Trouble extracting number portion after varying marker


    Hi,

    I'm trying to construct a regex that can extract only the green letter/number portions of the following lines:

    Pass 1234567
    Pass, 11223344
    Pass: 1234567
    Passport # is HA12345678.
    Passport #: G7654321
    Passport: 1234567 (Nepal)
    Passport No 876543210
    Passport No.: 123456789
    Passport No: TG1234567
    Passport Number 1234567
    Passport Number - 5432198765
    passport number, AH123456789
    Passport Number: AB123456
    Passport/Travel Document Number: AZ0912345

    I'm only interested in capturing the green letter/number parts but, the number must be in close proximity to the Pass/passport marker because there are other numbers in the email that could be mistaken as passport numbers but are not.

    I'm using Regular Expressions 5.5 in VBA under Outlook & Word 2010

    Any help would be appreciated.
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    spaceBAR Central
    Posts
    229
    Rep Power
    42
    Try this:
    Code:
    ([A-Z]?[A-Z]?[0-9]+)
    
    Match the character " " literally
    Match the regular expression below and capture its match into backreference number 1
    Match a single character in the range between "A" and "Z"
       Between zero and one times, as many times as possible, giving back as needed (greedy)
    Match a single character in the range between "A" and "Z"
       Between zero and one times, as many times as possible, giving back as needed (greedy)
    Match a single character in the range between "0" and "9"
       Between one and unlimited times, as many times as possible, giving back as needed (greedy)
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    Passport Tag ignored


    Hi,
    thanks for the reply. That's great for capturing the number but as I mentioned, it needs to be in close proximity to the Passport tag because the regex below will also match:

    Application No.: 12345678

    and that's not a passport number but it still matches.
    It needs to be in relation to the (varying) "Passport" tag.

    cheers


    Originally Posted by spacebar208
    Try this:
    Code:
    ([A-Z]?[A-Z]?[0-9]+)
    
    Match the character " " literally
    Match the regular expression below and capture its match into backreference number 1
    Match a single character in the range between "A" and "Z"
       Between zero and one times, as many times as possible, giving back as needed (greedy)
    Match a single character in the range between "A" and "Z"
       Between zero and one times, as many times as possible, giving back as needed (greedy)
    Match a single character in the range between "0" and "9"
       Between one and unlimited times, as many times as possible, giving back as needed (greedy)
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    spaceBAR Central
    Posts
    229
    Rep Power
    42
    Ok, Try this:
    Code:
    [Pp]assport.+([A-Z]?[A-Z]?[0-9]+)
  8. #5
  9. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    that regex doesn't find the right passport numbers, and it's very inefficient.

    It will first search for [Pp]assport. Then it will read the all the rest of the line (because "." by default matches any character except line breaks -- depending on the regex engine and modifiers). And then it tries to find to find any digit from the right, because a single digit is already enough to fulfil the last pattern. Since we're at the end of the line, the regex won't find a digit, so it will reduce the ".+" match and try again. If now it finds a digit, that will be the "passport number". If not, it will again and again reduce the ".+" match and try to find a digit, leading to a huge amount of backtracking (aka trial-and-error).

    Bottom line: The "." pattern is evil, because it's simply too unspecific. It can easily break the whole regex and also make it extremely inefficient.

    If you have to use it, be very, very careful. In this case, you want a non-greedy quantifier:

    Code:
    /pass.*? ([a-z]*\d+)/i
    Not pretty, but the input isn't pretty, either.
    Last edited by Jacques1; June 16th, 2013 at 06:29 AM.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    what am I doing wrong?


    Hi again,
    I've tried these regexes in RegexBuddy, RegexMagic and a VBA programs I've written exclusively for testing regexes and I can't seem to get any results.

    I've got Global turned on and case insensitive and still nothing. Where are you people using regexes? What software are you using to test this?

    What am I missing?
  12. #7
  13. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    I have no idea what you've tried, but the regex I wrote down does exactly what it's supposed to do:



    The regex by spacebar208 is wrong (as I already said).
    Attached Images
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Originally Posted by badmagic00
    Hi again,
    I've tried these regexes in RegexBuddy, RegexMagic and a VBA programs I've written exclusively for testing regexes and I can't seem to get any results.

    I've got Global turned on and case insensitive and still nothing. Where are you people using regexes? What software are you using to test this?

    What am I missing?
    Please explain what you did, because Jacques1's solution should obviously work in most (probably all) cases that you have shown. If you did not get anyone right, then there must be something wrong in your testing procedure.

    In the event that you have doubts, just one example under Perl, using the Perl debugger (which I am using quite regularly to test my regexes):

    Perl Code:
      DB<1> $c = "Passport No 876543210";
     
      DB<2> print $1 if $c =~ /pass.*? ([a-z]*\d+)/i;
    876543210
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    I am doing something wrong


    I was using RegexBuddy but had it set to javascript. It did work for another setting I tried (must have been JGsoft) but at work, I've only got Microsoft VBScript Regular Expressions 5.5 under VBA.

    Apparently, I have no look forward/backwards, around etc. capabilities.

    Should I be setting RegexBuddy to JGsoft to test rather than JavaScript?


    P.S.
    I appreciate all the help

    Cheers,
    Steve

    Originally Posted by Laurent_R
    Please explain what you did, because Jacques1's solution should obviously work in most (probably all) cases that you have shown. If you did not get anyone right, then there must be something wrong in your testing procedure.

    In the event that you have doubts, just one example under Perl, using the Perl debugger (which I am using quite regularly to test my regexes):

    Perl Code:
      DB<1> $c = "Passport No 876543210";
     
      DB<2> print $1 if $c =~ /pass.*? ([a-z]*\d+)/i;
    876543210
  18. #10
  19. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    There are no lookahead or lookbehind assertions in the regex. Not sure what you mean.

    I'm only using the most basic regex features, which work in every standard regex engine, regardless of whether it's "JGsoft" (whatever that means), JavaScript, Perl, PHP, Ruby, you name it. And I'm sure it also works in this VB stuff (at least that's what Google tells me).

    Double check your code to make sure there's no typo or something. If the backslash is used as an escaping character in VB strings, you may have to double it when using it in regex strings.

    If it still "doesn't work", then start debugging it. Can you find "pass"? Can you find "pass.*?"? etc.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    I agree that the regex suggested by Jacques1 uses completely standard features and it should even work with older tools such as awk, grep, egrep, sed, ed, Emacs, vi, expr, etc.

    Although I can't test that now, I am almost sure that it would work even with the very first regex packages developped by Ken Thomson in the 1970s and Henry Spencer in the 1980s.

    If you don't succeed, then you must be doing something else wrongly, please explain exactly what you are doing, copy and paste your attempts
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    Ah-Huh


    Ahhhh, now I get it! You guys are UNIX guys. If only I could do this on my Solaris or FreeBSD machine, I'd be a happy chappy - writing all of those beautifully crafted statements using commands like awk, egrep, sed etc. My editor of choice is Vi.

    Alas, I'm stuck in the world of Windows but, you all were right. I was doing something wrong. Now I'm getting matches but it's also including the tag.
  24. #13
  25. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    With "tag" you mean the "Passport:" prefix? The actual number is in the first capturing group, so that's what you need to refer to.

    According to regular-expressions.info, you have to do something like

    Code:
    <match object>.SubMatches(0)
    to get the first group.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    I am using Unix, Windows and VMS, so I am not specifically a Unix guy. The reason I mentioned awk, grep and the like is only because practical regexes were born in the Unix world.

    But it should work the same way under Windows. Can you explain what you mean when you say that your matched capture includes the tag?
  28. #15
  29. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    6
    Rep Power
    0

    Got It Working


    WoooHooo,

    I finally worked out what I was doing wrong. It works for me now.

    Jacques1 was correct. Thank you Jacques1 and thank you all very much for your time. I appreciate it.

    Cheers,
    SteveL (BM)

IMN logo majestic logo threadwatch logo seochat tools logo