#1
  1. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57

    Link parsing, bit buggy


    I currently have this
    Code:
    /(http:\\/\\/)?((www\\.)?[a-z][a-z0-9_\\.-]*\\.[a-z]{2,6}[a-zA-Z0-9\\/\\.\\?&%-]*)/i
    Works almost perfectly apart from it will turn "this...this" into links that obviously go nowhere have also seen it turn "good.For" into a non working link.

    Something to do with periods is wrong in my above code?
  2. #2
  3. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by liamdawe
    I currently have this
    Code:
    /(http:\\/\\/)?((www\\.)?[a-z][a-z0-9_\\.-]*\\.[a-z]{2,6}[a-zA-Z0-9\\/\\.\\?&%-]*)/i
    Works almost perfectly apart from it will turn "this...this" into links that obviously go nowhere have also seen it turn "good.For" into a non working link.

    Something to do with periods is wrong in my above code?
    Well, if you don't want "good.for" to be considerd an URL, you should provide a "list" of top level names to end your regex:

    Code:
    /your regex here(com|org|us|uk|nl|de|...)/
  4. #3
  5. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57
    Ah right okay (forum didnt email me to tell of new reply heh), i will give it a go, where you have "uk" should that not be "co.uk" ?
  6. #4
  7. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by liamdawe
    Ah right okay (forum didnt email me to tell of new reply heh), i will give it a go, where you have "uk" should that not be "co.uk" ?
    Although many websites end with ".co.uk" the TLD name is ".uk". So, if you say it should end with ".co.uk", the website of the British Library (http://www.bl.uk/ ) would be rejected! ; )
  8. #5
  9. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57
    Wow i actually never knew that lmao, okay thanks i will give it a go in a sec

    Edit >
    PHP Code:
            $pattern "/(http:\\/\\/)?((www\\.)?[a-z][a-z0-9_\\.-]*\\.[a-z]{2,6}[a-zA-Z0-9\\/\\.\\?&%-]*)\\.(com|org|us|uk|nl|de)/i"
    Seems to do nothing lol
    Last edited by liamdawe; February 9th, 2009 at 08:22 AM.
  10. #6
  11. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by liamdawe
    Wow i actually never knew that lmao, okay thanks i will give it a go in a sec

    Edit >
    PHP Code:
            $pattern "/(http:\\/\\/)?((www\\.)?[a-z][a-z0-9_\\.-]*\\.[a-z]{2,6}[a-zA-Z0-9\\/\\.\\?&%-]*)\\.(com|org|us|uk|nl|de)/i"
    Seems to do nothing lol
    When posting on a public forum and "it doesn't work", always post how you're testing it so that people can reproduce this (unexpected) behaviour.

    After removing some unnecessary escape characters, "it works" just fine:

    PHP Code:
    <?php
    $pattern 
    "#(http://)?((www\.)?[a-z][a-z0-9_.-]*\.[a-z]{2,6}[a-zA-Z0-9/.?&%-]*)\.(com|org|us|uk|nl|de)#i";
    if(
    preg_match($pattern'http://www.foo.com')) {
      echo 
    "OK";
    }
    ?>
  12. #7
  13. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57
    Right it mostly works but now i need it to match things after the actual domain like "/forum/blah.php?45345" kinda thing, it did it before but not now?

    Edit > woo i did it on my own, how does it look?
    PHP Code:
    $pattern "#(http://)?((www\.)?[a-z][a-z0-9_.-]*\.[a-z]{2,6}[a-zA-Z0-9/.?&%-]*)(\.com|org|us|uk|nl|de|info)([a-zA-Z0-9\\/\\.\\?&%-]*)#i"
    Last edited by liamdawe; February 10th, 2009 at 06:34 AM.
  14. #8
  15. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2004
    Location
    Northern Ireland
    Posts
    59
    Rep Power
    11
    Originally Posted by liamdawe
    Right it mostly works but now i need it to match things after the actual domain like "/forum/blah.php?45345" kinda thing, it did it before but not now?

    Edit > woo i did it on my own, how does it look?
    PHP Code:
    $pattern "#(http://)?((www\.)?[a-z][a-z0-9_.-]*\.[a-z]{2,6}[a-zA-Z0-9/.?&%-]*)(\.com|org|us|uk|nl|de|info)([a-zA-Z0-9\\/\\.\\?&%-]*)#i"
    Looks good. Maybe a few things to simplify your expression?
    You don't need the (www\.)? as it is included in the pattern "[a-z][a-z0-9_.-]*"
    You also don't need the "\." before the "com" as it is included in the before expression.

    Just one concern with listing all the tlds, there may be some that you have not included.
    There is a list here if that helps: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

    However, another way might be to match an expression that starts with an "http://" or a "www."
    You could then assume the entire string after this match was an address (upto a whitespace character)
    PHP Code:
    $pattern "#(http://|www\.)[^\s]*#i"
    Just a suggestion.
    "True Power Lies Within The Blood Of Your Peoples Revenge... The Devils Fruit Can Lead Me There..." - Uchiha Sasuke
  16. #9
  17. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57
    Thanks for your post i will have a play around with what you posted when i get to my dev machine tomorrow evening.

    I dont get what you meant in the last part though mate?
  18. #10
  19. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    May 2006
    Location
    England
    Posts
    634
    Rep Power
    57
    I have re-read what you posted "jedi_ralf" for the pattern you gave how do i output what it finds?

    Also for my current pattern being this:
    PHP Code:
    $pattern "#(http://)?((www\.)?[a-z][a-z0-9_.-]*\.[a-z]{2,6}[a-zA-Z0-9/.?&%-]*)(\.com|org|us|uk|nl|de|info)([a-zA-Z0-9\\/\\.\\?&%-]*)#i"
    It seems to crap out on "=" how can i get it to include an = in links?

IMN logo majestic logo threadwatch logo seochat tools logo