1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2009
    Rep Power

    Need help in regular expression: My first post in this forum

    need some help in the below regex. I am going through one site for learnign regular expressins. So far it has been going good
    but kind of stuck with below example. ( I want to know how regex works on token by token basis)

    the site says

    Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.

    Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails again.

    The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.

    String is <boo>bold</b> and regex is <(A-Z][A-Z0-9]*)[^>.*?</\1>

    I know iam getting it wrong, however acc to my understanding , and the articles i have read on regex so far, i felt the regex should have worked like below......Please correct me

    Regex ---Token String
    <([A-Z][A-Z0-9]*)[^>].*?</\1> <boo>bold

    1) < consumes <

    2) [A-Z] in round bracket consumes b

    3) [A-Z0-9]* in round bracket consumes oo

    therefore , first backreference stores boo

    4) ^> doesnot match >

    Since the above token has star, so it is ok and we proceed to next token of regex. The position of string remains same

    5) > consumes > (which is first one in the string)

    6) .*? lazy Regex Engine will skip this token as . is lazy

    7) < doesnot match b

    so engine backtracks to pt 6, and . consumes b. similarly backtracking occurs over and over and . consumes "bold"

    8)< consumes < (which is second one)

    9)\1 which i think like mentioned in point 3, it must have value boo

    boo doesnot match b

    10)So engine will backtrack to point 6 and now . will consume "bold<"

    11)< doesnot match \b

    so enigne backtracks and i guess . will now consume "bold<\b"

    but somehow its getting confusing from here ...Could anyone please help...The site mentioned below explains something else....iam unable to get it.....Thanks for your patience in advance
  2. #2
  3. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Pennsylvania, USA
    Rep Power
    The RegExp Power Toy will show you precisely how the steps are laid out.

    Read the rest of the "resources" thread at the top of this forum for more.


    Comments on this post

    • drgroove agrees : Regex Power Toy... nice ref :D thx!
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.

IMN logo majestic logo threadwatch logo seochat tools logo