September 14th, 2011, 05:01 PM
Regex help !
Hey guys, i am writing a web crawler to pick out urls on a page: the page is http://www.CENSORED.com/416/tag/linux.html and urls i am looking for are the download urls such as : http://www.CENSORED.com/software/CENSORED.html i already have a regex which is :
but thats picking out to much, any help would be greatly appreciated.
$expr = '!href="http://www.CENSORED.com/(.*?)/(\d+)-(.*?).html!i';
Last edited by requinix; September 14th, 2011 at 05:41 PM.
September 14th, 2011, 05:41 PM
URLs partially censored. No linking to warez sites.
Also, locked. If it was just a generic web crawler that's one thing, but you're specifically coding for one site - and even for a specific URL structure - and it's all very obviously for the purpose of indexing the illegal materials they reference.
I'm fine with your 400 Bad Request thread in PHP because that can be a problem regardless of what sites your bot crawls. But anything having to do any part whatsoever with the afore-censored site or its subject matter is off-limits. Keep it very generic.