#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    194
    Rep Power
    76

    Having a hard time with a URL parser using curl


    PHP Code:
    <?php
        
    function isValidUrl($url){
            
    // first do some quick sanity checks:
            
    if(!$url || !is_string($url)){
                return 
    false;
            }
            
    // quick check url is roughly a valid http request: ( http://blah/... ) 
            
    if( ! preg_match('/^http(s)?:\/\/[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(\/.*)?$/i'$url) ){
                return 
    false;
            }
            
    $url get_furl($url);
            
    // the next bit could be slow:
            
    if(getHttpResponseCode_using_curl($url) != 200 && getHttpResponseCode_using_curl($url) != 404){
                return 
    false;
            }
            
    // all good!
            
    return true;
        }
        
        function 
    get_furl($url) {
          
    $furl false;
          
    // First check response headers
          
    $headers get_headers($url);
          
    // Test for 301 or 302
          
    if(preg_match('/^HTTP\/\d\.\d\s+(301|302)/',$headers[0])) {
            foreach(
    $headers as $value) {
              if(
    substr(strtolower($value), 09) == "location:") {
                
    $furl trim(substr($value9strlen($value)));
              }
            }
          }
          
    // Set final URL
          
    $furl = ($furl) ? $furl $url;
          return 
    $furl;
        }

        function 
    getHttpResponseCode_using_curl($url$followredirects true){
            
    // returns int responsecode, or false (if url does not exist or connection timeout occurs)
            // NOTE: could potentially take up to 0-30 seconds , blocking further code execution (more or less depending on connection, target site, and local timeout settings))
            // if $followredirects == false: return the FIRST known httpcode (ignore redirects)
            // if $followredirects == true : return the LAST  known httpcode (when redirected)
            
    if(! $url || ! is_string($url)){
                return 
    false;
            }
            
    $ch = @curl_init($url);
            if(
    $ch === false){
                return 
    false;
            }
            @
    curl_setopt($chCURLOPT_HEADER         ,true);    // we want headers
            
    @curl_setopt($chCURLOPT_NOBODY         ,true);    // dont need body
            
    @curl_setopt($chCURLOPT_RETURNTRANSFER ,true);    // catch output (do NOT print!)
            
    if($followredirects){
                @
    curl_setopt($chCURLOPT_FOLLOWLOCATION ,true);
                @
    curl_setopt($chCURLOPT_MAXREDIRS      ,10);  // fairly random number, but could prevent unwanted endless redirects with followlocation=true
            
    }else{
                @
    curl_setopt($chCURLOPT_FOLLOWLOCATION ,false);
            }
    //      @curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,5);   // fairly random number (seconds)... but could prevent waiting forever to get a result
    //      @curl_setopt($ch, CURLOPT_TIMEOUT        ,6);   // fairly random number (seconds)... but could prevent waiting forever to get a result
    //      @curl_setopt($ch, CURLOPT_USERAGENT      ,"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1");   // pretend we're a regular browser
            
    @curl_exec($ch);
            if(@
    curl_errno($ch)){   // should be 0
                
    @curl_close($ch);
                return 
    false;
            }
            
    $code = @curl_getinfo($chCURLINFO_HTTP_CODE); // note: php.net documentation shows this returns a string, but really it returns an int
            
    @curl_close($ch);
            return 
    $code;
        }

    function 
    getUrls(){
      
    $urls = array("http://www.google.com/""http://b.com/","http://google.com/""http://website-styles.net46.net/""http://images.website-styles.net46.net/""http://hhh.website-styles.net46.net/""http://the-forum.net78.net/""http://www.the-forum.net78.net/");
      return 
    $urls;
    }

    $urls getUrls(); // some function getting say 10 or more external links

    foreach($urls as $k=>$url){
      
    // this could potentially take 0-30 seconds each
      // (more or less depending on connection, target site, timeout settings...)
      
    if( ! isValidUrl($url) ){
        unset(
    $urls[$k]);
      }
    }

    echo 
    "yay all done! now show my site";
    foreach(
    $urls as $url){
      echo 
    "<a href=\"{$url}\">{$url}</a><br/>";
    }
    $test_url get_furl("http://1.com");
    echo 
    "<br />".getHttpResponseCode_using_curl($test_url);

    ?>
    This code works perfectly for parsing urls and determining if they exist or not except for a small problem I ran in to

    for some reason my code shows that 1.com and b.com is returning a 200 header which it shouldn't because they dont even exist, I have no clue why it's doing this

    My last problem is that the http://hhh.website-styles.net46.net/ url doesn't exist but it is getting redirected to the hosting companies 404 page, how would I parse this so my code stops showing a 200 header for it
    Last edited by jack13580; April 24th, 2013 at 03:26 AM.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,349
    Rep Power
    594
    For one thing you've turned off all the error reporting. Get rid of all the '@' and then see what errors might be occurring.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,846
    Rep Power
    6351
    My last problem is that the http://hhh.website-styles.net46.net/ url doesn't exist but it is getting redirected to the hosting companies 404 page, how would I parse this so my code stops showing a 200 header for it
    You can't. The page does exist, it's just that the page you see is a "page doesn't exist" message. That's a page, and it exists. It's informing you, the human being, that something is wrong, but the computer thinks everything is right.

    You can detect 302 redirects like this one as being invalid, but that will generate a LOT of false positives.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.

IMN logo majestic logo threadwatch logo seochat tools logo