#1
  1. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    140

    Question RegExp: Question about the caret operator in multiline strings with flag re.MULTILINE


    Hello everybody


    OS: Windows Vista (32 bits)
    Python version: 3.2.3
    Text editor: Notepad++ with Windows End Of Line


    I have a question about the caret operator within python regular expressions. According to the online documentation (Library reference):
    http://docs.python.org/release/3.2.3/library/re.html
    ... (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline ...
    Well, I wanted to see how this works as it could really be helpful in some cases to check for example the first character of a multiline string. I created the following script:

    Code:
    import re
    
    def main():
        text = """line1
    line2
    line3"""
    
        prog = re.compile(r"^l", re.MULTILINE)
        if prog.match(text):
            print ("Text does match the regular expression")
        else:
            print ("Text does not match the regular expression")
        
        
    main()
    So in this example, let's say that we want to see whether the string matches a pattern according to which the first character of each line (right after the new line character) is the letter 'l' (the lowercase of 'L'). Well this example, obviously works as 'line1', 'line2' and 'line3' start with the letter 'l'. So here is the output of the script

    Code:
    C:\> python -tt myscript.py
    Text does match the regular expression
    C:\>
    Just, to see the impact of the re.MULTILINE, I changed the first letter of the second line and I put for example 'D' instead of 'l' and I expected the text to be rejected this time, yet I got the very same output. Again I checked the online documentation for the flag MULTILINE in the re module:

    http://docs.python.org/release/3.2.3/library/re.html#re.MULTILINE
    When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
    So if the pattern matches the beginning of each line, why the multiline string is not rejected when the first character of the second line does not start with 'l'?

    Could someone kindly make some clarification?

    Thanks in advance,
    Dariyoosh
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,997
    Rep Power
    481
    I haven't found anyone yet who understands why there is a re match functionality. Your program might work as you expect if in place of the nasty match use use

    prog.search
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. Jealous Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,303
    Rep Power
    9400
    Originally Posted by dariyoosh
    So if the pattern matches the beginning of each line, why the multiline string is not rejected when the first character of the second line does not start with 'l'?
    Because it does match on the first line. All it requires is one match somewhere.

    A better test would be changing the first letter of the first line. With MULTILINE it will still match (on the second line) and without it will not match.
  6. #4
  7. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    140
    Originally Posted by b49P23TIvg
    ... Your program might work as you expect if in place of the nasty match use prog.search ...
    The problem with search() is that as I understand it looks for at least one occurrence of the searched pattern and does not impose that all occurrences match the pattern (I want to make sure that the first character of each line starts by the letter 'l') So for example if the first two lines don't start with 'l' and only the third line starts with 'l' then search() will validate the string because at least one occurrence (in the third line) was found.
  8. #5
  9. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    140
    Originally Posted by requinix
    ... Because it does match on the first line. All it requires is one match somewhere.
    I think this is actually (according to the online documentation) the definition of the search() function not the match() function.
    http://docs.python.org/release/3.2.3/library/re.html#re.match
    ...
    Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

    If you want to locate a match anywhere in string, use search() instead.
    ...
    So what I understand is that the MULTILINE flag is irrelevent while using the match() function within the context of my problem.

    Originally Posted by requinix
    ... A better test would be changing the first letter of the first line. With MULTILINE it will still match (on the second line) and without it will not match. ...
    Not really, because I changed the code in the following way
    Code:
    import re
    
    def main():
        text = """sine1       # So here I put 's' instead of 'l'
    line2
    line3"""
    
        prog = re.compile(r"^l", re.MULTILINE)
        if prog.match(text):
            print ("Text does match the regular expression")
        else:
            print ("Text does not match the regular expression")
        
        
    main()
    And we can see that even with the MULTILINE flag it doesn't match when we
    change the first letter of the first line.
    Code:
    C:\> python -tt myscript.py
    Text does not match the regular expression
    C:\>
  10. #6
  11. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    140
    So here is finally how I managed to solve the problem (using both findall() and match())

    Code:
    import re
    import sys
    
    def main():
        text = """line1
    line2
    line3"""
    
        entireLineProg = re.compile(r"[^\r\n]+")
        lines = entireLineProg.findall(text)
        firstOfLineProg = re.compile("^l")
        for token in lines:
            if firstOfLineProg.match(token):
                continue
            else:
                print ("Text does not match the regular expression")
                print ("bad token = " + token)
                sys.exit(-1)
                
        print ("The text was validated successfully according to the pattern")
        
    main()
    and this time it worked as I expected.

    Thank you very much both of you for your time and your attention to my problem.


    Regards,
    Dariyoosh
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,997
    Rep Power
    481
    When using match why trouble yourself with the "start of line caret"?
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Location
    Iran
    Posts
    149
    Rep Power
    140
    Originally Posted by b49P23TIvg
    When using match why trouble yourself with the "start of line caret"?
    This was just a particular example, because currently I'm reading the re module documentation in order to learn and understand better the regular expressions (that I should admit can sometimes become tricky!) and therefore in the document different operators including caret were explained. I encountered this problem among several different test scripts that I had created while I was reading the document.

    So, the purpose of the question was just for learning and in fact there is not always necessarily the need to use caret each time we use match()

    Originally Posted by b49P23TIvg
    I haven't found anyone yet who understands why there is a re match functionality.
    Well, I think my question proved that match() can be useful in some cases

    Thanks a again,

IMN logo majestic logo threadwatch logo seochat tools logo