#1
  1. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139

    Question A question about the proper use of {m,n} and {m,n}? in regular expressions


    Dear all,


    OS: Windows Vista (32 bits)
    Python version: 3.2.3


    Currently I'm reading the online Python 3 library reference, chapter of regular expressions (module re) and there is something that I think I'm not able to understand properly, so I would appreciate if you could kindly give me a hand.

    According to the manual: http://docs.python.org/release/3.2.3/library/re.html#module-re

    {m,n}
    ... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible ...
    Just to see how it works, I created the following test script. We suppose that we wish to search for a string in which between 1 to 3 digits precede the letter 'b' (for example 012b, 00b, 458b, 7b, ... will be valid entries)

    Code:
    import re
    
    def main():
        text = "123456b"
        prog = re.compile("(?P<group1>[0-9]{1,3})b")
        matchObject = prog.search(text)
        if matchObject:
            print("the text matches.")
            print(matchObject.group("group1"))
        else:
            print("the text does not match.")
        
    main()
    And the output of the script is the following
    Code:
     
    C:\> python -tt myscript.py
    the text matches.
    456
    C:\>
    Now, according to the manual, {m,n} i greedy (= matches as much text as possible) and this is exactly what we see in the output of the script. We searched between 1 and 3 occurrences of digits before the letter b, and the script printed the longest match (3 occurrences before b) that is 456. After this test I changed the code in the following way: instead of writing
    Code:
     prog = re.compile("(?P<group1>[0-9]{1,3})b")
    I wrote
    Code:
     prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
    According to the manual:
    {m,n}?
    ... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible...
    So based on this definition, I expect the output of the script be 6 that is only one digit occurrence before the letter 'b'. Yet, I saw that the script produces the same output as the first test that is 456 (three digit occurrences before the letter 'b'). Obviously, I'm confused, could somebody kindly make some clarification? Why {1,3}? doesn't act based on the inferior born in order to find only one digit occurrence before the letter 'b'?

    Thanks in advance,
    Dariyoosh
  2. #2
  3. Lord of the Dance
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Oct 2003
    Posts
    3,612
    Rep Power
    1945
    Interesting case. If you remove b from the RE, then the format {m,n}? does work.
    Unfortunately, I don't have any idea why this happens.
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,841
    Rep Power
    480
    If you think you've found an error please report it to
    bugs.python.org

    Last I knew, the python re module is supposed to be equivalent to perl regular expressions. How does perl behave? I'm not a native perl programmer, this would be an easier experiment for someone else.
    Last edited by b49P23TIvg; February 16th, 2013 at 10:22 AM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    139
    Well, after a bit googling I found an interesting article here:

    http://www.regular-expressions.info/repeat.html
    ... Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match ...
    Also after reading again the online python RE module document and doing some more tests, I arrived at the following conclusions:
    • Search is always done from left to right (I mean the initial pattern search and not the backtracking process)

    • The search function stops as soon as the first match has been found

    • By default, the longest leftmost pattern (from left to right) is chosen, unless the ? operator being specified (for example {m,n}? instead of {m,n}) but again this is true only if we consider patterns from left to right


    Also I created a more elaborated test which seems to me confirming the above rules

    Code:
    import re
    
    def main():
        text1 = "123b456b"
        text2 = "b123b456"
    
        # First test with text1 when the letter 'b' comes at the end of the text
        prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
        matchObject = prog.search(text1)
        allMatchObjects = prog.findall(text1)
        print("************ Result for the text ", text1, "***********")
        if matchObject:
            print("The text ", text1, " matches.")
            print("group of digits before the letter 'b' = ",
                matchObject.group("group1"))
        else:
            print("The text ", text1, " does not match.")
        print("finall = ", end="")
        for token in allMatchObjects:
            print(token, ", ", end="")
       
        # Second test with test2 when the letter 'b' is at the beginning of the text
        prog = re.compile("b(?P<group1>[0-9]{1,3}?)")
        matchObject = prog.search(text2)
        allMatchObjects = prog.findall(text2)
        print("\n")
        print("************ Result for the text ", text2, "***********")
        if matchObject:
            print("The text ", text2, " matches.")
            print("group of digits before the letter 'b' = ",
                matchObject.group("group1"))
        else:
            print("The text ", text2, " does not match.")
        print("finall = ", end="")
        for token in allMatchObjects:
            print(token, ", ", end="")
       
    main()
    And the output is

    Code:
    ************ Result for the text  123b456b ***********
    The text  123b456b  matches.
    group of digits before the letter 'b' =  123
    finall = 123 , 456 ,
    
    ************ Result for the text  b123b456 ***********
    The text  b123b456  matches.
    group of digits before the letter 'b' =  1
    finall = 1 , 4 ,
    Comparing both results, we can see that each time the search() function returns the first longest leftmost match (group of digits before the letter 'b') except that in the case of test2 because of the non greedy regular expression (based on the ? operator) only the first smallest leftmost match is returned.

    Likewise in the original post I was trying to search a match between 1 to three digits before the letter 'b'
    Code:
    text = "123456b"
    prog = re.compile("(?P<group1>[0-9]{1,3})b")
    I think we can say, that the search starts in the following way
    from left to right

    1 - the digit 1 is read the search() function continues

    2 - the digit 2 is read the search() function continues

    3 - the digit 3 is read the search() function continues

    4 - the digit 4 is read. Before reading the digit 4 we had already read three digits "123" and according to {1,3} this means that we have already reached the max number of digits before a letter 'b'. As a result, so far, there is no match, but we have not yet arrived to the end of the string, therefore the search() function continues a new search start point (that is the digit 4 that we've just read) by reading the next element in the string

    5 - the digit 5 is read the search() function continues

    6 - the digit 6 is read the search() function continues

    7 - the letter b is read, the engine knows that it has already read three digits "456" , there is a match, this is the first match from left to right and it is returned therefore the program returns "456" as the output of the script.


    This is how I think RE engine functions on the problem that I wrote in the OP

    Any remarks/correction will be welcome!


    Regards,
    Dariyoosh

IMN logo majestic logo threadwatch logo seochat tools logo