#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2015
    Posts
    4
    Rep Power
    0

    Stealthy Web Bots


    Trying to learn more about stealthy web bots and webscraping using php / cURL. I am using the below code and just testing out how stealthy it is. In cURL here has a referrer address that is suppose to tell the page from which the script is coming from. This works for the target page but when you pull images and css style sheets it uses the actual page that it is pulling it from. Is there a work around to this? That kind of defeats the purpose of having a REFERRER ADDRESS.

    I have tried this on my own server just to see if I can identify if the data appears to be coming from a bot or a actual person. With the exception of that referrer page, it appears to be working but that basically removes that abliity to be appear like a person browsing vs an automated bot.

    Any thoughts on this. Nothing urgent here, this is more for educational purposes at this point.

    PHP Code:
        $curl_handle=curl_init();
        
    curl_setopt($curl_handleCURLOPT_HEADERtrue); 
        
    curl_setopt($curl_handle,CURLOPT_FOLLOWLOCATION,1);
        
    curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
        
    curl_setopt($curl_handleCURLOPT_USERAGENT"Mozilla/5.0.....");
        
    curl_setopt($curl_handleCURLOPT_REFERER"http://theReferralWebsiteHere.com");  <-- REFERRER ADDRESS
        curl_setopt
    ($curl_handle,CURLOPT_CONNECTTIMEOUT,120);
        
    curl_setopt($curl_handle,CURLOPT_TIMEOUT,120);
        
    curl_setopt($curl_handle,CURLOPT_MAXREDIRS,4);
        
    curl_setopt($curl_handle,CURLOPT_URL,"http://theTargetWebPageHere.html"); <- TARGET PAGE BEING SCRAPED
        $buffer 
    curl_exec($curl_handle);
        
    curl_close($curl_handle); 
  2. #2
  3. Anemic Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,861
    Rep Power
    9433
    Originally Posted by Timothy Park
    Is there a work around to this?
    Workaround for what?

    Originally Posted by Timothy Park
    That kind of defeats the purpose of having a REFERRER ADDRESS.
    Uh, no? The referrer for the stylesheet is the originating page that told the browser to get it, same way the referrer for a page is the originating page that told the browser to get it.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2015
    Posts
    4
    Rep Power
    0
    So my script is located and something like this..... 302 Found
    I put in a referrer script something like this.....
    PHP Code:
     curl_setopt($curl_handleCURLOPT_REFERER"http://theReferralWebsiteHere.com"); 
    When it look at my server logs, the page being pulled shows ..... theReferralWebsiteHere.com. Now the images and css files show in the logs with a referrer site as my original script page.... myWebSite.com/myScript.php.

    So for the admin looking at the server logs, will clearly identify where you are coming from by looking at where the CSS and Images are being pulled from making your bot not so stealth any longer.

    So my real question is can you make it appear like the CSS / Images are bing pulled from the .... theReferralWebsiteHere.com .... page per the curl_setopt() function? If this is not possible, this tells me two things, 1) You can't make your bot truly stealth and 2) You can easily identify bots that are trying to hide.

    Hope my question is clear. Thanks again for your help.
  6. #4
  7. Anemic Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,861
    Rep Power
    9433
    Originally Posted by Timothy Park
    So my script is located and something like this..... 302 Found
    When it look at my server logs, the page being pulled shows ..... theReferralWebsiteHere.com. Now the images and css files show in the logs with a referrer site as my original script page.... myWebSite.com/myScript.php.
    Because apparently you outputted the result to the browser, so then the browser will start requesting stylesheets and whatever. Your proxying was completely behind the curtain - the browser has no idea that you didn't create the content yourself.

    If you want to hide the referrer for those things then you have to actually proxy stuff. Which is what you're starting to make now: not a bot but a proxy. Except you'll have to find a way to keep track of the fact that the resources were being requested through a particular page - which the browser can't help you with (it'll say myscript.php for everything).
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2015
    Posts
    4
    Rep Power
    0
    Ok, thanks, I have to do some more homework on this. Thanks for the explanation.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2015
    Posts
    4
    Rep Power
    0
    Ok so after a little experimenting using cron job what I realized is that I would have to do pull each file (css, img, etc) with a separate cURL call if I want to mimic a browser vs looking like a bot. Since a human using a browser would pull all those files when visiting a website I would assume the administrator would look for that as a sign of a crawler vs real person browsing. So by pulling all the files on that page and changing the user agent to a browser that seems to be the best way to approach that at least from what I have figured out so far. Does that sound about right?
  12. #7
  13. Anemic Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,861
    Rep Power
    9433
    Yes, except that you'd "need" to mimic the referrer too, which means knowing what page you just proxied that caused the browser to send you a request for another file.
    And you'll need to rewrite assorted links and URLs to go through your site. That includes HTML and CSS. You'd also need to rewrite Javascript, which will be basically impossible, so for educational purposes you can skip it.

IMN logo majestic logo threadwatch logo seochat tools logo