Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old February 16th, 2013, 06:34 AM
dariyoosh's Avatar
dariyoosh dariyoosh is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Location: Iran / France
Posts: 132 dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level) 
Time spent in forums: 2 Days 6 h 25 m 17 sec
Reputation Power: 133
Question A question about the proper use of {m,n} and {m,n}? in regular expressions

Dear all,


OS: Windows Vista (32 bits)
Python version: 3.2.3


Currently I'm reading the online Python 3 library reference, chapter of regular expressions (module re) and there is something that I think I'm not able to understand properly, so I would appreciate if you could kindly give me a hand.

According to the manual: http://docs.python.org/release/3.2.3/library/re.html#module-re

Quote:
{m,n}
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible ...
Just to see how it works, I created the following test script. We suppose that we wish to search for a string in which between 1 to 3 digits precede the letter 'b' (for example 012b, 00b, 458b, 7b, ... will be valid entries)

Code:
import re

def main():
    text = "123456b"
    prog = re.compile("(?P<group1>[0-9]{1,3})b")
    matchObject = prog.search(text)
    if matchObject:
        print("the text matches.")
        print(matchObject.group("group1"))
    else:
        print("the text does not match.")
    
main()
And the output of the script is the following
Code:
 
C:\> python -tt myscript.py
the text matches.
456
C:\>
Now, according to the manual, {m,n} i greedy (= matches as much text as possible) and this is exactly what we see in the output of the script. We searched between 1 and 3 occurrences of digits before the letter b, and the script printed the longest match (3 occurrences before b) that is 456. After this test I changed the code in the following way: instead of writing
Code:
 prog = re.compile("(?P<group1>[0-9]{1,3})b") 
I wrote
Code:
 prog = re.compile("(?P<group1>[0-9]{1,3}?)b") 
According to the manual:
Quote:
{m,n}?
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible...
So based on this definition, I expect the output of the script be 6 that is only one digit occurrence before the letter 'b'. Yet, I saw that the script produces the same output as the first test that is 456 (three digit occurrences before the letter 'b'). Obviously, I'm confused, could somebody kindly make some clarification? Why {1,3}? doesn't act based on the inferior born in order to find only one digit occurrence before the letter 'b'?

Thanks in advance,
Dariyoosh

Reply With Quote
  #2  
Old February 16th, 2013, 07:24 AM
MrFujin's Avatar
MrFujin MrFujin is offline
Lord of the Dance
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Oct 2003
Posts: 3,130 MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level)MrFujin User rank is General 11st Grade (Above 100000 Reputation Level) 
Time spent in forums: 2 Months 2 Weeks 23 h 46 m 38 sec
Reputation Power: 1736
Interesting case. If you remove b from the RE, then the format {m,n}? does work.
Unfortunately, I don't have any idea why this happens.

Reply With Quote
  #3  
Old February 16th, 2013, 10:20 AM
b49P23TIvg's Avatar
b49P23TIvg b49P23TIvg is offline
Contributing User
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Aug 2011
Posts: 3,383 b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level)b49P23TIvg User rank is Major (30000 - 40000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 3 Days 13 h 44 m 42 sec
Reputation Power: 383
If you think you've found an error please report it to
bugs.python.org

Last I knew, the python re module is supposed to be equivalent to perl regular expressions. How does perl behave? I'm not a native perl programmer, this would be an easier experiment for someone else.
__________________
[code]Code tags[/code] are essential for python code!

Last edited by b49P23TIvg : February 16th, 2013 at 10:22 AM.

Reply With Quote
  #4  
Old February 17th, 2013, 12:36 PM
dariyoosh's Avatar
dariyoosh dariyoosh is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Location: Iran / France
Posts: 132 dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level)dariyoosh User rank is First Lieutenant (10000 - 20000 Reputation Level) 
Time spent in forums: 2 Days 6 h 25 m 17 sec
Reputation Power: 133
Well, after a bit googling I found an interesting article here:

http://www.regular-expressions.info/repeat.html
Quote:
... Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match ...
Also after reading again the online python RE module document and doing some more tests, I arrived at the following conclusions:
  • Search is always done from left to right (I mean the initial pattern search and not the backtracking process)
  • The search function stops as soon as the first match has been found
  • By default, the longest leftmost pattern (from left to right) is chosen, unless the ? operator being specified (for example {m,n}? instead of {m,n}) but again this is true only if we consider patterns from left to right

Also I created a more elaborated test which seems to me confirming the above rules

Code:
import re

def main():
    text1 = "123b456b"
    text2 = "b123b456"

    # First test with text1 when the letter 'b' comes at the end of the text
    prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
    matchObject = prog.search(text1)
    allMatchObjects = prog.findall(text1)
    print("************ Result for the text ", text1, "***********")
    if matchObject:
        print("The text ", text1, " matches.")
        print("group of digits before the letter 'b' = ",
            matchObject.group("group1"))
    else:
        print("The text ", text1, " does not match.")
    print("finall = ", end="")
    for token in allMatchObjects:
        print(token, ", ", end="")
   
    # Second test with test2 when the letter 'b' is at the beginning of the text
    prog = re.compile("b(?P<group1>[0-9]{1,3}?)")
    matchObject = prog.search(text2)
    allMatchObjects = prog.findall(text2)
    print("\n")
    print("************ Result for the text ", text2, "***********")
    if matchObject:
        print("The text ", text2, " matches.")
        print("group of digits before the letter 'b' = ",
            matchObject.group("group1"))
    else:
        print("The text ", text2, " does not match.")
    print("finall = ", end="")
    for token in allMatchObjects:
        print(token, ", ", end="")
   
main()


And the output is

Code:
************ Result for the text  123b456b ***********
The text  123b456b  matches.
group of digits before the letter 'b' =  123
finall = 123 , 456 ,

************ Result for the text  b123b456 ***********
The text  b123b456  matches.
group of digits before the letter 'b' =  1
finall = 1 , 4 ,


Comparing both results, we can see that each time the search() function returns the first longest leftmost match (group of digits before the letter 'b') except that in the case of test2 because of the non greedy regular expression (based on the ? operator) only the first smallest leftmost match is returned.

Likewise in the original post I was trying to search a match between 1 to three digits before the letter 'b'
Code:
text = "123456b"
prog = re.compile("(?P<group1>[0-9]{1,3})b")

I think we can say, that the search starts in the following way
from left to right

1 - the digit 1 is read the search() function continues

2 - the digit 2 is read the search() function continues

3 - the digit 3 is read the search() function continues

4 - the digit 4 is read. Before reading the digit 4 we had already read three digits "123" and according to {1,3} this means that we have already reached the max number of digits before a letter 'b'. As a result, so far, there is no match, but we have not yet arrived to the end of the string, therefore the search() function continues a new search start point (that is the digit 4 that we've just read) by reading the next element in the string

5 - the digit 5 is read the search() function continues

6 - the digit 6 is read the search() function continues

7 - the letter b is read, the engine knows that it has already read three digits "456" , there is a match, this is the first match from left to right and it is returned therefore the program returns "456" as the output of the script.


This is how I think RE engine functions on the problem that I wrote in the OP

Any remarks/correction will be welcome!


Regards,
Dariyoosh

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > A question about the proper use of {m,n} and {m,n}? in regular expressions

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap