#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    3
    Rep Power
    0

    Optional character in look-around


    Hi people..

    I'm not a programmer, nor will I ever be. Still, I need to learn enough of regex to solve some issues I'm working on. I have learned about optional characters and look-around. Now I need to combine these but I'm running into problems and I can't figure it out. I'm sure an experienced programmer could help me along in a minute....?

    I'm testing with regex powertoy: (java regex syntax (match only))
    (can't post the url though, but it is a link I found somewhere on these forums)

    My target text is:
    wordleft STRING. Wordright

    I need to match:
    STRING

    My regex:
    (?<=wordleft\s).+(?=\.\sWordright)

    And all is well...
    Until I found out that in some case my regex did not return anything. Turns out that the . (dot) is sometimes omitted in my target text.

    So I changed my regex to:
    (?<=wordleft\s).+(?=\.?\sWordright)

    In my target text without the . this results in: STRING (hurray)
    In my target text with the . this results in: STRING. (booo)
    I need to match STRING whether the . is there or not.
    So I need the . to be optional in my look-around, but when it is there, I don't want to match it...

    Now I tried all sorts of quantifiers but without any luck. What did I miss? Any help would be appreciated!
  2. #2
  3. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,938
    Rep Power
    1045
    Hi,

    this is a typical problem, and it's also the reason why your regex is extremely inefficient.

    The dot matches any character, and the pattern .+ swallows the whole string until the very end (or at least until the end of the line).

    So what happens is that you first read in the whole string. Then the regex realizes you want a string after this, so it reduces the matched string and tries again. The lookahead still doesn't match, so it again recuces the match and tries again. This trial-and-error goes on until finally the regex has arrived just before the "Wordright". The regex never matches the optional dot, because it can already stop before that.

    This is obviously extremely inefficient and not what you want. As a rule of thumb: Never use the dot unless you really, truly know what you're doing.

    In this case, you can fix the problem by replacing the greedy + quantifier (which reads as much as it can) with the non-greedy +? quantifier (which reads as little as it can):

    Code:
    (?<=wordleft\s).+?(?=\.\sWordright)

    Comments on this post

    • baasbas agrees
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    3
    Rep Power
    0
    Hi Jacques,

    thank you for the tip. It is frustrating when you have the idea that you are close to a solution but can't find the glitch.. I see now that I was trying to fix my boundary, while I needed to fix my match...

    I did realize that .+ is 'all consuming'. At this point, STRING can be literally anything so I really need to use the .

    I now see that .+? changes from 'get everything you can' to 'get as little as you can'.

    Thanks a bunch
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2013
    Posts
    3
    Rep Power
    0
    I still have a question though....

    Code:
    (?<=wordleft\s).+?(?=\.\sWordright)
    matches: STRING in
    wordleft STRING. Wordright

    But does not match anything in
    wordleft STRING Wordright


    Can I match STRING in both targets with one regex?


    EDIT: AARgh..
    Code:
    (?<=wordleft\s).+?(?=\.?\sWordright)
    will do just that. Sorry for the extra post...

IMN logo majestic logo threadwatch logo seochat tools logo