Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Location
    London
    Posts
    48
    Rep Power
    16

    Screen Scraping multiple pages


    I recently read a tutorial about screen scraping, but this tutorial was about grabbing the data from only one page, i want my script to go to certain pages (eg http://sharetv.org/shows/family_guy/episodes/199721/01x01 and http://sharetv.org/shows/family_guy/episodes/199723/01x03) only the show episode pages and grab show info: Episode, Title, Type, Production Code, First Aired, Summary and The Top left image any ideas, heres the link to the tutorial for grabing info from one page, http://www.bradino.com/php/screen-scraping/

    Cheers
    Tom
  2. #2
  3. Banned

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,326
    Rep Power
    0
    That's not a good way to do screen scraping.

    Use [PHPNET="book.curl"]CURL[/PHPNET] to grab the page and then load it into a [PHPNET="domdocument"]DOMDocument[/PHPNET].
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Location
    London
    Posts
    48
    Rep Power
    16
    Originally Posted by b3n
    That's not a good way to do screen scraping.

    Use [PHPNET="book.curl"]CURL[/PHPNET] to grab the page and then load it into a [PHPNET="domdocument"]DOMDocument[/PHPNET].
    How would i go about grabing from certain pages pages of the websites (the episode info page)? And grab the info?
  6. #4
  7. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    Actually, this is a perfectly acceptable way to do screen scraping, since the DOM document dies on any malformed HTML content, and 8 of the top 10 most visited sites in the world can't be parsed by the DOM. I just ran this script:
    PHP Code:
    $sites = array(
    'http://www.google.com',
    'http://www.facebook.com',
    'http://www.yahoo.com',
    'http://www.youtube.com',
    'http://www.live.com',
    'http://www.wikipedia.org',
    'http://www.blogger.com',
    'http://www.baidu.com',
    'http://www.msn.com',
    'http://www.qq.com',
    );

    foreach ( 
    $sites as $site ) {
        
    $xmlDoc = new DOMDocument(); 
        if ( @
    $xmlDoc->load($site) ) {
            echo 
    "{$site} successful<br />\n";
        } else {
            echo 
    "{$site} invalid<br />\n";
        }
    }
    die(); 
    Only live.com and msn.com can be parsed by the DOM. It's nice to think about doing it "right" and using the dom, but that means trusting the other webmaster to write better HTML than yahoo, google, and facebook.

    As for your question: do one, then do the other. You can store the URLs to fetch in an array and loop through them, if the pages are exactly the same, or you can just write the code twice if they're different.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  8. #5
  9. I fail at spelling
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Sep 2003
    Location
    NDAuNjIxMTExLC03OS4xNTU=
    Posts
    3,219
    Rep Power
    1779
    Hey,

    Here's what I've been using and it seems to work pretty well: http://simplehtmldom.sourceforge.net/manual.htm
    I am working now with Symfony2, Twig, Doctrine, Composer, Assetic, and HTML5. Enjoying doing what I do everyday!
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Location
    London
    Posts
    48
    Rep Power
    16
    im looking to copy basically the whole of the tv shows eipsodes site so it would take to long to add all the links manually. Is there away to grab all info from http://sharetv.org/shows/*/episodoes/ Where * is the anytvshow so it i will grab all episode info.

    Cheers
    Tom
  12. #7
  13. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    You'll have to write a site spider that crawls some kind of index page or search page, searches for all the TV show links, then crawls each one sequentially.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Location
    London
    Posts
    48
    Rep Power
    16
    Originally Posted by ManiacDan
    You'll have to write a site spider that crawls some kind of index page or search page, searches for all the TV show links, then crawls each one sequentially.

    -Dan
    Any ideas where i can find out how to do that?

    Cheers
    Tom
  16. #9
  17. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    1) Fetch the index page.
    2) preg_match_all all the links from the page you want to spider.
    3) foreach link, spider the resulting page.
    4) Paginate the index page as necessary.

    You already are following a screen scraping tutorial, add a loop to that and you're done.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  18. #10
  19. Banned

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,326
    Rep Power
    0
    Originally Posted by ManiacDan
    > Since the DOM document dies on any malformed HTML content
    This is incorrect. You are confused with XML You need to use [PHPNET="domdocument.loadhtml"]loadHTML()[/PHPNET] not [PHPNET="domdocument.load"]load()[/PHPNET].

    The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load.
    Of course most websites are invalid, so suppress the errors about invalid markup.
    PHP Code:
    <?php
    $ch 
    curl_init();

    curl_setopt($chCURLOPT_USERAGENT'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729)');
    curl_setopt($chCURLOPT_URL$url);
    curl_setopt($chCURLOPT_REFERER$url);
    curl_setopt($chCURLOPT_ENCODING''); // all supported encoding types (identity,deflate,gzip)
    curl_setopt($chCURLOPT_FAILONERRORtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
    curl_setopt($chCURLOPT_HEADERfalse);
    curl_setopt($chCURLOPT_NOBODYfalse);
    curl_setopt($chCURLOPT_TIMEOUT45);

    if ( ! 
    $html curl_exec($ch))  
    {
        echo 
    '<p>'.curl_error($ch).'</p>';
    }
    else
    {             
        
    $dom = new DOMDocument;
        
        if ( @ 
    $dom->loadHTML($html)) // suppress warning errors
        
    {
            
    $imgs $dom->getElementsByTagName('img');
        }
    }     
    ?>
    preg_ functions are very slow. Using preg_ functions on an (invalid) HTML file is EVEN slower.
  20. #11
  21. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    Hmm, you're right, it does pass the top 10 with "loadHTML()"

    Still though, is it faster to treat an entire HTML document as a tree of objects rather than just running string parsing functions on them? I think a preg_match_all would be faster than loading it into the dom and running getElement*. Not going to bother speed-testing it now though.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  22. #12
  23. Banned

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,326
    Rep Power
    0
    Originally Posted by ManiacDan
    Still though, is it faster to treat an entire HTML document as a tree of objects rather than just running string parsing functions on them?
    Yes. Trust me. I tested it recently - DOMDocument is MUCH faster.
  24. #13
  25. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    Interesting. I did a speed test:
    PHP Code:
    $sites = array(
    'http://www.google.com',
    'http://www.facebook.com',
    'http://www.yahoo.com',
    'http://www.youtube.com',
    'http://www.live.com',
    'http://www.wikipedia.org',
    'http://www.blogger.com',
    'http://www.baidu.com',
    'http://www.msn.com',
    'http://www.qq.com',
    );

    foreach ( 
    $sites as $site ) {
        
    $ch curl_init();

        
    curl_setopt($chCURLOPT_USERAGENT'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729)');
        
    curl_setopt($chCURLOPT_URL$site);
        
    curl_setopt($chCURLOPT_REFERER$site);
        
    curl_setopt($chCURLOPT_ENCODING''); // all supported encoding types (identity,deflate,gzip)
        
    curl_setopt($chCURLOPT_FAILONERRORtrue);
        
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
        
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
        
    curl_setopt($chCURLOPT_HEADERfalse);
        
    curl_setopt($chCURLOPT_NOBODYfalse);
        
    curl_setopt($chCURLOPT_TIMEOUT45);

        if ( ! 
    $html curl_exec($ch))  
        {
            echo 
    '<p>'.curl_error($ch).'</p>';
        }
        else
        {              
            
    $xmlDoc = new DOMDocument(); 
            if ( @
    $xmlDoc->loadHTML($html) ) {
                echo 
    "{$site} successful.  Found " count($xmlDoc->getElementsByTagName('img')) . " images.<br />\n";
            } else {
                echo 
    "{$site} invalid<br />\n";
            }
        }
    }
    die(); 
    Script executed in 4.57 seconds.
    PHP Code:
    $sites = array(
    'http://www.google.com',
    'http://www.facebook.com',
    'http://www.yahoo.com',
    'http://www.youtube.com',
    'http://www.live.com',
    'http://www.wikipedia.org',
    'http://www.blogger.com',
    'http://www.baidu.com',
    'http://www.msn.com',
    'http://www.qq.com',
    );

    foreach ( 
    $sites as $site ) {
        
    $ch curl_init();

        
    curl_setopt($chCURLOPT_USERAGENT'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729)');
        
    curl_setopt($chCURLOPT_URL$site);
        
    curl_setopt($chCURLOPT_REFERER$site);
        
    curl_setopt($chCURLOPT_ENCODING''); // all supported encoding types (identity,deflate,gzip)
        
    curl_setopt($chCURLOPT_FAILONERRORtrue);
        
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
        
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
        
    curl_setopt($chCURLOPT_HEADERfalse);
        
    curl_setopt($chCURLOPT_NOBODYfalse);
        
    curl_setopt($chCURLOPT_TIMEOUT45);

        if ( ! 
    $html curl_exec($ch))  
        {
            echo 
    '<p>'.curl_error($ch).'</p>';
        }
        else
        {        
            if ( 
    preg_match_all("#<img[^>]+>#i"$html$matches) ) {
                echo 
    "{$site} successful.  Found " count($matches[0]) . " images.<br />\n";
            } else {
                echo 
    "{$site} invalid<br />\n";
            }
        }
    }
    die(); 
    Script executed in 4.41 seconds.

    It's also worth noting that the DomDocument method didn't actually return the correct number of images, but I'm sure that's because the count() function doesn't work on them, I never really used DOM.

    If you're doing anything more complex than "find the occurrences of this tag" you should probably use DOM, but for things like "find this section of a page" or "grab all the images," I still recommend string functions. Native string functions (like substr, from the example given by the OP) are faster than preg, preg seems to be just as fast as DOM.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2003
    Location
    London
    Posts
    48
    Rep Power
    16
    You can see the code im using at http://www.e2tv.tv/p.txt and http://www.e2tv.tv/cScrape.txt but when you go to p.php just shows a blank page. please tell me where im going wrong and how to fix it please.

    Cheers
    Tom
  28. #15
  29. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2005
    Location
    Internet
    Posts
    7,625
    Rep Power
    6088
    Paste the code here...
    Chat Server Project & Tutorial | WiFi-remote-control sailboat (building) | Joke Thread
    “Rational thinkers deplore the excesses of democracy; it abuses the individual and elevates the mob. The death of Socrates was its finest fruit.”
    Use XXX in a comment to flag something that is bogus but works. Use FIXME to flag something that is bogus and broken. Use TODO to leave yourself reminders. Calling a program finished before all these points are checked off is lazy.
    -Partial Credit: Sun

    If I ask you to redescribe your problem, it's because when you describe issues in detail, you often get a *click* and you suddenly know the solutions.
    Ches Koblents
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo