December 30th, 2009, 01:31 AM
Need help in regular expression: My first post in this forum
need some help in the below regex. I am going through one site for learnign regular expressins. So far it has been going good
but kind of stuck with below example. ( I want to know how regex works on token by token basis)
the site says
Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.
Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails again.
The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.
String is <boo>bold</b> and regex is <(A-Z][A-Z0-9]*)[^>.*?</\1>
I know iam getting it wrong, however acc to my understanding , and the articles i have read on regex so far, i felt the regex should have worked like below......Please correct me
Regex ---Token String
1) < consumes <
2) [A-Z] in round bracket consumes b
3) [A-Z0-9]* in round bracket consumes oo
therefore , first backreference stores boo
4) ^> doesnot match >
Since the above token has star, so it is ok and we proceed to next token of regex. The position of string remains same
5) > consumes > (which is first one in the string)
6) .*? lazy Regex Engine will skip this token as . is lazy
7) < doesnot match b
so engine backtracks to pt 6, and . consumes b. similarly backtracking occurs over and over and . consumes "bold"
8)< consumes < (which is second one)
9)\1 which i think like mentioned in point 3, it must have value boo
boo doesnot match b
10)So engine will backtrack to point 6 and now . will consume "bold<"
11)< doesnot match \b
so enigne backtracks and i guess . will now consume "bold<\b"
but somehow its getting confusing from here ...Could anyone please help...The site mentioned below explains something else....iam unable to get it.....Thanks for your patience in advance
December 30th, 2009, 10:03 AM
The RegExp Power Toy will show you precisely how the steps are laid out.
Read the rest of the "resources" thread at the top of this forum for more.
Comments on this post
HEY! YOU! Read the New User Guide and Forum Rules
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin
"The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002
Think we're being rude? Maybe you asked a bad question
or you're a Help Vampire.
Trying to argue intelligently? Please read this.