#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Location
    USA
    Posts
    1
    Rep Power
    0

    PHP Scrape With Preg_Match


    I'm trying to scrape html but I can't figure out how to use preg_match and foreach or loop. For example, I would like the php code to scan each class=paragraph_style_2 for the price after the '$' and also the text which is before the price and create two separate arrays, one for prices and another for the types, i.e. walk-in, 10 class pack, 20 class pack. I appreciate it so much! Even if you point me in the right direction, I would try to figure it out.

    Here is the html:

    Code:
    <p class="paragraph_style_2"><br /></p> <p class="paragraph_style_2">Walk-in $18<br /></p> <p class="paragraph_style_2">10 Class Pack $160<br /></p> <p class="paragraph_style_2">20 Class Pack $300<br /></p> <p class="paragraph_style_2">Monthly Unlimited $149<br /></p>
    I tried using simple_html_dom.php from the Internet, but it's not separating the price from price type, and I think the code should be a lot simpler, perhaps without even using simple_html_dom.

    Here is what I have so far:

    PHP Code:
    <?php include('php/simple_html_dom.php');
    function 
    scraping_even() {
        
    // create HTML DOM
        
    $html file_get_html('(some URL)');

        foreach(
    $html->find('.graphic_textbox_layout_style_default') as $article) {
            
    // get price and price type
            
    $item['p'] = trim($article->find('.paragraph_style_2'2)->plaintext);
            
    $item['p1'] = trim($article->find('.paragraph_style_2'3)->plaintext);
            
    $item['p2'] = trim($article->find('.paragraph_style_2'4)->plaintext);
            
    $item['p3'] = trim($article->find('.paragraph_style_2'5)->plaintext);

            
    $ret[] = $item;
        }
        
        
    // clean up memory
        
    $html->clear();
        unset(
    $html);

        return 
    $ret;
    }

    // test it

    // check user_agent header...
    ini_set('user_agent''My-Application/2.5');
    ini_set('display_errors',1); 
     
    error_reporting(E_ALL);

    $ret scraping_even();

    foreach(
    $ret as $v) {
        echo 
    $v['p'].'<br>';
        echo 
    $v['p1'].'<br>';
        echo 
    $v['p2'].'<br>';
        echo 
    $v['p3'].'<br>';
    }


    ?>

    The result is:

    Walk-in¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $18
    10 Class Pack¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $160
    20 Class Pack¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $300
    Monthly Unlimited¬ ¬ ¬ ¬ $149

    which does not remove the spaces before the $, and does not create two arrays.
    Last edited by requinix; January 24th, 2013 at 07:35 PM. Reason: added a missing quote
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,383
    Rep Power
    594
    It is not completely clear to me what you are trying to do but I think you want to use split to extract your data. Something like:
    PHP Code:
    $str=split($article->find('.paragraph_style_2'2)->plaintext,"$");
    $item['p']=$str[1]; 
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,993
    Rep Power
    9397
    Originally Posted by gw1500se
    I think you want to use split to extract your data.
    split() is deprecated and supports simple regular expressions. explode(), like you linked to, is more appropriate.

    Comments on this post

    • gw1500se agrees : Sorry, I know. I've been doing a lot of perl programming lately and typed split out of habit. IMO, split should have been left alone.

IMN logo majestic logo threadwatch logo seochat tools logo