#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2012
    Posts
    4
    Rep Power
    0

    Custom Lenient URL Regex


    I need a very very lenient URL regular expression that works with Python's re module. Before you say it, I have Googled and tried many of the ones available on the web.

    Writing a robust regular expression like this myself would be way beyond my current abilities.

    If you Google for "improved_regex_for_matching_urls" you will find the most popular one, it has a couple of flaws:
    * It cannot cope with subdomains
    * It will match www . google . com (Ignore spaces added due to new user forum rule block.) fine but not google.com

    Perhaps someone could modify the above one or write another one from scratch. I believe other people might find this solution useful, perhaps for making a wiki/CMS.
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,709
    Rep Power
    480
    I avoid writing complicated regular expressions, preferring instead a divide-to-conquer approach.
    Another good thing to avoid---the ill-defined problem.

    Perhaps you could write in Backus-Naur form what you consider a valid address?

    Perhaps you could use the popular regular expression, and if it fails prepend www. , try again.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2012
    Posts
    4
    Rep Power
    0
    I remembered that I have a old friend that uses regexes, and he wrote me this wonderful little one:
    Code:
    (?i)([^a-z0-9]|^)((http|https)://)?(?P<domain>([0-9a-z]+\.)*[^\.][0-9a-z]*\.[a-z]{2,5})([^a-z0-9]|$)

IMN logo majestic logo threadwatch logo seochat tools logo