Thread: Scraping HTML

Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0

    Scraping HTML


    Hi,

    I would like to create a script that gets a URL of a webpage and an ID of a chosen element - and then can return the content of that element.

    I did a little digging at couldn't get something to work like that - hope someone can help..

    Thanks
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,378
    Rep Power
    594
    You probably need to use DOM.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    Thanks.. anyone can give me more specific info?
  6. #4
  7. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,850
    Rep Power
    6351
    ...can you? The DOM parses an HTML document and can return elements by their ID. The word was underlined because it was a link to the manual.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    Thanks,

    I followed the DOMDocument::getElementById instruction, to extract the price of a product from amazon, this is my code:

    PHP Code:
    <?php

    $doc 
    = new DomDocument;

    // We need to validate our document before refering to the id
    $doc->validateOnParse true;
    $doc->Load('http://www.amazon.com/gp/product/B008HTJMY6/ref=s9_ri_gw_g107_ir07?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-3&pf_rd_r=13GCRBKJPG8DHF0PSR94&pf_rd_t=101&pf_rd_p=1339748982&pf_rd_i=507846');

    echo 
    "The element whose id is 'wirelessPriceFromPrice' is: " $doc->getElementById('wirelessPriceFromPrice') . "\n";

    ?>
    I get this error:
    Code:
    Warning: DOMDocument::load() [domdocument.load]: Validation failed: no DTD found !AttValue: " or ' expected in http://www.amazon.com/gp/product/B008HTJMY6/ref=s9_ri_gw_g107_ir07?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-3&pf_rd_r=13GCRBKJPG8DHF0PSR94&pf_rd_t=101&pf_rd_p=1339748982&pf_rd_i=507846, line: 383
    Any idea why and how to make this work?

    Thanks
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,378
    Rep Power
    594
    The error is quite explicit. Check here.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    Originally Posted by gw1500se
    The error is quite explicit. Check here.
    Ok, so how can I scrape this data from amazon although it's not validated?

    Thanks again..
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,378
    Rep Power
    594
    Why do you think you need to validate? Perhaps if you explain in more detail what you are trying to accomplish we could better help.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    Originally Posted by gw1500se
    Why do you think you need to validate? Perhaps if you explain in more detail what you are trying to accomplish we could better help.
    Ok, sure - I would like to be able to scrape prices from product pages of e-commerce sites.

    I will find the xpath of the element that contain the price data for a site and need a PHP script that can get the URL of that product page and the xpath - and return the price.

    Hope you can help me find a simple way to do this..
    Cheers
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,378
    Rep Power
    594
    You still did not explain why you thing you need to validate.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    I don't need to validate - that's something I found on the DOM resource and it's probably not the way to scrape the data I need.

    You can ignore that part - any idea on how to do this?
  22. #12
  23. No Profile Picture
    I haz teh codez!
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Dec 2003
    Posts
    2,548
    Rep Power
    2337
    You're trying to break Amazon's rules?

    Link
    I ♥ ManiacDan & requinix

    This is a sig, and not necessarily a comment on the OP:
    Please don't be a help vampire!
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2008
    Posts
    22
    Rep Power
    0
    Originally Posted by ptr2void
    You're trying to break Amazon's rules?

    Link
    The prices was just an example - I would like to scrape other info too and not necessarily from amazon... so not to worry, amazon is safe for now


    Anyone??
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,378
    Rep Power
    594
    Being wary of prt2void's warning (other sites will likely have similar TOS), what are you getting when you don't validate?
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  28. #15
  29. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,850
    Rep Power
    6351
    Give an actual example of what you want, we can't continue to help you maybe spider a potential div which may be on a webserver somewhere.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo