1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2013
    Rep Power

    PHP Scrape With Preg_Match

    I'm trying to scrape html but I can't figure out how to use preg_match and foreach or loop. For example, I would like the php code to scan each class=paragraph_style_2 for the price after the '$' and also the text which is before the price and create two separate arrays, one for prices and another for the types, i.e. walk-in, 10 class pack, 20 class pack. I appreciate it so much! Even if you point me in the right direction, I would try to figure it out.

    Here is the html:

    <p class="paragraph_style_2"><br /></p> <p class="paragraph_style_2">Walk-in $18<br /></p> <p class="paragraph_style_2">10 Class Pack $160<br /></p> <p class="paragraph_style_2">20 Class Pack $300<br /></p> <p class="paragraph_style_2">Monthly Unlimited $149<br /></p>
    I tried using simple_html_dom.php from the Internet, but it's not separating the price from price type, and I think the code should be a lot simpler, perhaps without even using simple_html_dom.

    Here is what I have so far:

    PHP Code:
    <?php include('php/simple_html_dom.php');
    scraping_even() {
    // create HTML DOM
    $html file_get_html('(some URL)');

    $html->find('.graphic_textbox_layout_style_default') as $article) {
    // get price and price type
    $item['p'] = trim($article->find('.paragraph_style_2'2)->plaintext);
    $item['p1'] = trim($article->find('.paragraph_style_2'3)->plaintext);
    $item['p2'] = trim($article->find('.paragraph_style_2'4)->plaintext);
    $item['p3'] = trim($article->find('.paragraph_style_2'5)->plaintext);

    $ret[] = $item;
    // clean up memory


    // test it

    // check user_agent header...

    $ret scraping_even();

    $ret as $v) {


    The result is:

    Walk-in¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $18
    10 Class Pack¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $160
    20 Class Pack¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ $300
    Monthly Unlimited¬ ¬ ¬ ¬ $149

    which does not remove the spaces before the $, and does not create two arrays.
    Last edited by requinix; January 24th, 2013 at 08:35 PM. Reason: added a missing quote
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Rep Power
    It is not completely clear to me what you are trying to do but I think you want to use split to extract your data. Something like:
    PHP Code:
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. Backwards Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Washington, USA
    Rep Power
    Originally Posted by gw1500se
    I think you want to use split to extract your data.
    split() is deprecated and supports simple regular expressions. explode(), like you linked to, is more appropriate.

    Comments on this post

    • gw1500se agrees : Sorry, I know. I've been doing a lot of perl programming lately and typed split out of habit. IMO, split should have been left alone.

IMN logo majestic logo threadwatch logo seochat tools logo