The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
A question about the proper use of {m,n} and {m,n}? in regular expressions
Discuss A question about the proper use of {m,n} and {m,n}? in regular expressions in the Python Programming forum on Dev Shed. A question about the proper use of {m,n} and {m,n}? in regular expressions Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

February 16th, 2013, 06:34 AM
|
 |
Contributing User
|
|
Join Date: Nov 2012
Location: Iran / France
|
|
A question about the proper use of {m,n} and {m,n}? in regular expressions
Dear all,
OS: Windows Vista (32 bits)
Python version: 3.2.3
Currently I'm reading the online Python 3 library reference, chapter of regular expressions (module re) and there is something that I think I'm not able to understand properly, so I would appreciate if you could kindly give me a hand.
According to the manual: http://docs.python.org/release/3.2.3/library/re.html#module-re
Quote:
{m,n}
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible ... | Just to see how it works, I created the following test script. We suppose that we wish to search for a string in which between 1 to 3 digits precede the letter 'b' (for example 012b, 00b, 458b, 7b, ... will be valid entries)
Code:
import re
def main():
text = "123456b"
prog = re.compile("(?P<group1>[0-9]{1,3})b")
matchObject = prog.search(text)
if matchObject:
print("the text matches.")
print(matchObject.group("group1"))
else:
print("the text does not match.")
main()
And the output of the script is the following
Code:
C:\> python -tt myscript.py
the text matches.
456
C:\>
Now, according to the manual, {m,n} i greedy (= matches as much text as possible) and this is exactly what we see in the output of the script. We searched between 1 and 3 occurrences of digits before the letter b, and the script printed the longest match (3 occurrences before b) that is 456. After this test I changed the code in the following way: instead of writing
Code:
prog = re.compile("(?P<group1>[0-9]{1,3})b")
I wrote
Code:
prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
According to the manual: Quote: {m,n}?
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible... | So based on this definition, I expect the output of the script be 6 that is only one digit occurrence before the letter 'b'. Yet, I saw that the script produces the same output as the first test that is 456 (three digit occurrences before the letter 'b'). Obviously, I'm confused, could somebody kindly make some clarification? Why {1,3}? doesn't act based on the inferior born in order to find only one digit occurrence before the letter 'b'?
Thanks in advance,
Dariyoosh
|

February 16th, 2013, 07:24 AM
|
 |
Lord of the Dance
|
|
|
|
|
Interesting case. If you remove b from the RE, then the format {m,n}? does work.
Unfortunately, I don't have any idea why this happens.
|

February 16th, 2013, 10:20 AM
|
 |
Contributing User
|
|
|
|
|
If you think you've found an error please report it to
bugs.python.org
Last I knew, the python re module is supposed to be equivalent to perl regular expressions. How does perl behave? I'm not a native perl programmer, this would be an easier experiment for someone else.
__________________
[code] Code tags[/code] are essential for python code!
Last edited by b49P23TIvg : February 16th, 2013 at 10:22 AM.
|

February 17th, 2013, 12:36 PM
|
 |
Contributing User
|
|
Join Date: Nov 2012
Location: Iran / France
|
|
Well, after a bit googling I found an interesting article here:
http://www.regular-expressions.info/repeat.html
Quote:
... Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match ... | Also after reading again the online python RE module document and doing some more tests, I arrived at the following conclusions: - Search is always done from left to right (I mean the initial pattern search and not the backtracking process)
- The search function stops as soon as the first match has been found
- By default, the longest leftmost pattern (from left to right) is chosen, unless the ? operator being specified (for example {m,n}? instead of {m,n}) but again this is true only if we consider patterns from left to right
Also I created a more elaborated test which seems to me confirming the above rules
Code:
import re
def main():
text1 = "123b456b"
text2 = "b123b456"
# First test with text1 when the letter 'b' comes at the end of the text
prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
matchObject = prog.search(text1)
allMatchObjects = prog.findall(text1)
print("************ Result for the text ", text1, "***********")
if matchObject:
print("The text ", text1, " matches.")
print("group of digits before the letter 'b' = ",
matchObject.group("group1"))
else:
print("The text ", text1, " does not match.")
print("finall = ", end="")
for token in allMatchObjects:
print(token, ", ", end="")
# Second test with test2 when the letter 'b' is at the beginning of the text
prog = re.compile("b(?P<group1>[0-9]{1,3}?)")
matchObject = prog.search(text2)
allMatchObjects = prog.findall(text2)
print("\n")
print("************ Result for the text ", text2, "***********")
if matchObject:
print("The text ", text2, " matches.")
print("group of digits before the letter 'b' = ",
matchObject.group("group1"))
else:
print("The text ", text2, " does not match.")
print("finall = ", end="")
for token in allMatchObjects:
print(token, ", ", end="")
main()
And the output is
Code:
************ Result for the text 123b456b ***********
The text 123b456b matches.
group of digits before the letter 'b' = 123
finall = 123 , 456 ,
************ Result for the text b123b456 ***********
The text b123b456 matches.
group of digits before the letter 'b' = 1
finall = 1 , 4 ,
Comparing both results, we can see that each time the search() function returns the first longest leftmost match ( group of digits before the letter 'b') except that in the case of test2 because of the non greedy regular expression (based on the ? operator) only the first smallest leftmost match is returned.
Likewise in the original post I was trying to search a match between 1 to three digits before the letter 'b'
Code:
text = "123456b"
prog = re.compile("(?P<group1>[0-9]{1,3})b")
I think we can say, that the search starts in the following way
from left to right
1 - the digit 1 is read the search() function continues
2 - the digit 2 is read the search() function continues
3 - the digit 3 is read the search() function continues
4 - the digit 4 is read. Before reading the digit 4 we had already read three digits "123" and according to {1,3} this means that we have already reached the max number of digits before a letter 'b'. As a result, so far, there is no match, but we have not yet arrived to the end of the string, therefore the search() function continues a new search start point (that is the digit 4 that we've just read) by reading the next element in the string
5 - the digit 5 is read the search() function continues
6 - the digit 6 is read the search() function continues
7 - the letter b is read, the engine knows that it has already read three digits "456" , there is a match, this is the first match from left to right and it is returned therefore the program returns " 456" as the output of the script.
This is how I think RE engine functions on the problem that I wrote in the OP
Any remarks/correction will be welcome!
Regards,
Dariyoosh
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|