#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Location
    USA
    Posts
    4
    Rep Power
    0

    Wink PHP extract 'img src' from HTML


    Hey all,

    I have a pet project I have been working on where I have a working script that finds articles across a few sites such as(cnn,geek,huff-post,wired) and stores the article url in a MySQL DB for later use. I also extract the title of each article and store this in the DB. I was able to get that part working with no issues. The problem I am having is that I would also like to extract the ' img src url' that is associated with each article. Such as if there is an article on wired talking about wordpress, they will typically have an image of something wordpress related included in the article.

    My initial solution was to just write another script that will later go through my DB of article url's and extract the img url and store it along side of the article url I am already storing. I have run into some issues in what the best way to do that would be.

    I have done a bit of research on my own and seen a few ideas such as using HTML dom parser to navigate to those pages and then extracting the image url but since there may be a few images on each page, I figured that could be a problem. Also Since I have about 200 urls, that are included from roughly 8 different domains, I am not sure what the best path to take would be.

    I am looking for any input such as any php library you could recommend for this task or any code to get me in the right direction. Or if you could recommend something that isn't php that would be fine as well, maybe a python solution could work, just assumed PHP would be easiest being all of the url's for the articles are stored in MySQL.

    EDIT: FYI, I am currently using PHP 5.3. I have an apache web server, so no issues utilizing additional languages or libraries.

    Thanks in advance for all the help!
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Location
    USA
    Posts
    4
    Rep Power
    0

    Question Started using HTML DOM Parser


    I tried having a go at using HTML DOM Parser to get the img url. I got it to return all of the img url's I just dont know how to only extract only 1 of the img url's and only the correct img url.

    Here is what I have so far which isn't much.
    PHP Code:
    include 'dom_parser.php';

    // Create DOM from URL
    $html file_get_html(' URL address blocked ');

    //Find all images
    foreach($html->find('img') as $element)
        echo 
    $element->src '<br>'
    Like I said before I have about 200 url's in the DB i would need to get the img src for and then store it alongside the article URL in the DB, I just selected a URL i already have in the DB to test it on for now.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,466
    Rep Power
    594
    You need to determine how that image is associated with the article in question and use that to figure out which is the right one. Without seeing the HTML it will be hard to help you much more.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Location
    USA
    Posts
    4
    Rep Power
    0
    Originally Posted by gw1500se
    You need to determine how that image is associated with the article in question and use that to figure out which is the right one. Without seeing the HTML it will be hard to help you much more.
    Thanks for the response. Understandable without seeing the actual page. I am trying the following:

    PHP Code:
    $html file_get_html('URL HERE');
        
        
    $ret $html->find('.cnn_stryimg640captioned');
        
    $ret $html->find('img');
        echo 
    $ret->src '<br>'
    I will try to replicate the DOM to help since you can not see the link.
    It is:
    Code:
     <div class="cnn_stryimg640captioned">
    <img src="url of image">
    </div>
    When running the script above, it just returns a blank page. Im not sure If I am navigating the DOM incorrectly or I fudged the script up.

    Thanks!
    Last edited by ityler; December 7th, 2012 at 02:20 PM. Reason: spelled classes wrong
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,466
    Rep Power
    594
    Instead of using 'find' you might have better luck using 'query'.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Location
    USA
    Posts
    4
    Rep Power
    0
    Originally Posted by gw1500se
    Instead of using 'find' you might have better luck using 'query'.
    Thanks again! I just took a look at using query and it does seem like that may be better suited for what I am doing, since I have already been working it the other way here is what I have at the moment not using query, but sticking with find.

    PHP Code:
    foreach($html->find('div.cnn_stryimg640captioned') as $element)
        {
        echo 
    $element// for debug purp
        
    $query $element->find('img');
        echo 
    $query->src '<br>';
        } 
    This returns the image that I am looking for the src for but it is still not display the image src with the last echo line. Any idea why that last line wouldn't return the src but I can still return the full div with the image with the first echo line.

    Thanks!
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,466
    Rep Power
    594
    I don't use 'find' (actually I can't even locate documentation on it) but I don't think it returns what you expect. All the examples I've seen imply that it updates '$element' to point to the found node. I believe 'find' returns success/failure.
    PHP Code:
    foreach($html->find('div.cnn_stryimg640captioned') as $element)
        {
        echo 
    $element// for debug purp
        
    if ($element->find('img')) {
            echo 
    $element->src '<br>';
        }
        else {
            echo 
    "image not found<br />";
        }

    There are 10 kinds of people in the world. Those that understand binary and those that don't.

IMN logo majestic logo threadwatch logo seochat tools logo