Thread: Reg Ex Help

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    20
    Rep Power
    0

    Reg Ex Help


    Hello everyone. I have been trying to make a regex to find web url's but have not been able to. Fortunately I have found one that does it for me.

    Code:
    re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', page)
    However although I have my answer I still would like to know exactly how this works. Is there anyone that can break this down? I've been googling and on the docs.python site I have all the explainations to what each does but can't seem to put it all together. Any help will be greatly appreciated. Thank you.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,711
    Rep Power
    480
    Code:
    h         Match an "h".
    http[s]   The part in [square brackets] means to match any of these characters.  An "s".
    ...p[s]?  ? means to match zero or one of the preceding regular expression (re).
              The preceding re is [s].  So far we can find http or https.
              Although they brackets aren't needed, they're visually nice.
    ...://    Match :// exactly.  As in, http://
    
    
    ...(?:...)
              (?: starts a group that won't be saved for later substitions or pattern reuse.
              Doesn't actually match anything
    [a-zA-Z]
              match 1 lower or upper case letter
    |[0-9]
              or a digit
    |[$-_@.&+]
              or any of these characters.
    
              WHICH CHARACTERS?  Look up $ and _ in an ASCII table.
              Since $ precedes _ the hyphen indicates the inclusive range of all characters
              between $ and _.  @.&+, the uppercase letters, and digits are in this set.
              It's redundant.  Oh well.  The redundancy is not an error.
    |[!*\(\),]
              or any of these characters
    
    |(?:%[0-9a-fA-F][0-9a-fA-F])
              or a group.  The group is a percent sign followed by two hexadecimal digits.
    
    ...)+     close the group.  And finally, the + sign means "match 1 or more or the preceding re".
              The preceding re is the group in parentheses
              (?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    20
    Rep Power
    0
    Thank you very much!!!! You've help out a lot. One question, how do I know that the re stops at .com or .org and doesn't keep matching different characters. Sorry it's not completely clear yet.
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,711
    Rep Power
    480
    Let's view a web page source. a few of the The url's end right before a " character, a > character, or a ) character. Or a space. Or a '.

    Of these, Space and " are the only ones not in the set
    [$-_]

    Therefore, the pattern works by luck where you've tried it. It doesn't, in my opinion, identify the right end of the URL.
    [code]Code tags[/code] are essential for python code and Makefiles!

IMN logo majestic logo threadwatch logo seochat tools logo