Thread: Regex help !

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    16
    Rep Power
    0

    Regex help !


    Hey guys, i am writing a web crawler to pick out urls on a page: the page is http://www.CENSORED.com/416/tag/linux.html and urls i am looking for are the download urls such as : http://www.CENSORED.com/software/CENSORED.html i already have a regex which is :
    Code:
     $expr = '!href="http://www.CENSORED.com/(.*?)/(\d+)-(.*?).html!i';
    but thats picking out to much, any help would be greatly appreciated.

    Thank you

    james
    Last edited by requinix; September 14th, 2011 at 05:41 PM.
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,295
    Rep Power
    9400
    URLs partially censored. No linking to warez sites.

    Also, locked. If it was just a generic web crawler that's one thing, but you're specifically coding for one site - and even for a specific URL structure - and it's all very obviously for the purpose of indexing the illegal materials they reference.

    I'm fine with your 400 Bad Request thread in PHP because that can be a problem regardless of what sites your bot crawls. But anything having to do any part whatsoever with the afore-censored site or its subject matter is off-limits. Keep it very generic.

IMN logo majestic logo threadwatch logo seochat tools logo