Thread: URL filtering

    #1
  1. No Profile Picture
    Permanently Banned
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2007
    Location
    Tacoma, WA
    Posts
    199
    Rep Power
    0

    URL filtering


    I've been using this to filter out bad URLs. But it doesn't like # or ! or ? in the string. Is there a better Regex for this job?
    php Code:
    <?
    // Random "real world" URL
    $offsiteURL = "http://www.facebook.com/people/Eco-Chic/1365032959#!/profile.php?id=1365032959";
    $pattern = '/^(([\w]+<img src="http://images.devshed.com/fds/smilies/smile.gif" border="0" alt="" title="Smilie" class="inlineimg" />?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(<img src="http://images.devshed.com/fds/smilies/frown.gif" border="0" alt="" title="Frown" class="inlineimg" />[\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&amp;?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
    if ((strlen(trim($offsiteURL)) > 0) && (!preg_match($pattern, $offsiteURL))) {  
    	echo "BAD";
    }
     
    ?>
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,119
    Rep Power
    9398
    What constitutes a "bad URL"? Being invalid? Are you sure it's a URL or are you picking them out of some block of text?
    Last edited by requinix; February 13th, 2010 at 05:04 PM.
  4. #3
  5. No Profile Picture
    Permanently Banned
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2007
    Location
    Tacoma, WA
    Posts
    199
    Rep Power
    0
    It's part of input validation.

    The regex I'm using doesn't like #, !, or ? in the URL, that's all that makes it "bad".

    So really, it's a "bad" regex, because #, !, and ? are certainly "OK" in a URL. At least they are widely used.
  6. #4
  7. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,119
    Rep Power
    9398
  8. #5
  9. No Profile Picture
    Permanently Banned
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2007
    Location
    Tacoma, WA
    Posts
    199
    Rep Power
    0
    Well, I rewrote it myself, and managed to hack out something that works:
    $pattern = '/^(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?$/';
    But I'm unsure how to make the protocol part (ftp|http|https):\/\/ optional.

    I thought of skipping regex altogether and use filter_var with the FILTER_VALIDATE_URL, but amazingly it doesn't validate all allowed characters.
  10. #6
  11. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,127
    Rep Power
    4304
    Try this:
    Code:
    $pattern = '/^((?:ftp|http|https):\/\/)?(\w+\:?\w*@)?(\S+(?:\:\d+)?)(\/|\/(?:[\w#!:.;?+=&%@!\-\/]+))?$/';
    Normally "?" is short for "{0,1}".
    The "?:" is used to tell the RegExp to not save a back-reference to a sub-group.

    Comments on this post

    • Weekend Coder agrees : Thank you, sir!
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).

IMN logo majestic logo threadwatch logo seochat tools logo