Thread: PHP Import HTML

    #1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11

    PHP Import HTML


    Hi

    Hopefully somebody can help me with a problem I have.

    I have a URL that represents webpage, what I want to do is print a certain part of the page that is surrounded by certain comment marks....

    ie

    blah blah blah blah blah blah blah
    <!-- BEGIN: Module - Main Article--> <p>
    An Israeli cabinet minister resigned this morning and called for Prime
    Minister Ehud Olmert to follow suit after a damn

    <!-- end -->
    blah blah blah blah blah blah

    What I want a php script to do, is just return the text that is around the begin and end comments. Can anybody please guide me in the right direction? Ive tried using the import XML function but this page is not valid XML its HTML. Ive Googled importing a URL to a variable but am not having much luck, can anyone guide me in the right direction please?

    Many thanks
  2. #2
  3. Prisoner of the Sun

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,309
    Rep Power
    2351
    Use [PHPNET="curl"]cURL[/PHPNET] or [PHPNET="fopen"]fopen()[/PHPNET] to get the page.

    Then use functions like [PHPNET="explode"]explode()[/PHPNET], [PHPNET="preg_match"]preg_match()[/PHPNET], [PHPNET="preg_replace"]preg_replace()[/PHPNET], etc to parse the data you want out of the file.
    .
    :: My blip.fm tunes :: Web Design Feeds :: Web Dev Feeds :: CheatSheets :: PHP :: MySQL :: 13 Moon FB App.

    "All matter is merely energy condensed to a slow vibration. We are all one consciousness experiencing itself - subjectively. There is no such thing as death, life is only a dream. We are the imaginations of ourselves."
    - Bill Hicks


    "Truth is hidden in the subtle nature of the heart of everything, although it is invisible. One cannot see it from inside and neither from the surface. One can only live and experience it."
    - Heart Sutra
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11
    Thanks for that b3n!

    I have been looking at getting this file to work, but get this error:
    Code:
    <?php
    function binary_search_in_file($filename, $search) {
    
        //Open the file
        $fp = fopen($filename, 'r');
    
        //Seek to the end
        fseek($fp, 0, SEEK_END);
    
        //Get the max value
        $high = ftell($fp);
        
        //Set the low value
        $low = 0;
    
        while ($low <= $high) { 
            $mid = floor(($low + $high) / 2);  // C floors for you
    
            //Seek to half way through
            fseek($fp, $mid);
    
            if($mid != 0){
                //Read a line to move to eol
                $line = fgets($fp);
            }
            
            //Read a line to get data
            $line = fgets($fp);
            
    
            if ($line == $search) {
                fclose($fp);
                return $line;
            }
            else {
                if ($search < $line) {
                    $high = $mid - 1;
                }
                else {
                    $low = $mid + 1;
                }
            }
        }
    
        //Close the pointer
        fclose($fp);
    
        return FALSE;
    
    } 
    $url="http://www.timesonline.co.uk/tol/news/world/middle_east/article1731131.ece";
    $strr="BEGIN: Module - Main Article";
    $value = binary_search_in_file($url, $strr);
    ?>
    Warning: fseek() [function.fseek]: stream does not support seeking in C:\localhost\content.php on line 8

    Warning: fseek() [function.fseek]: stream does not support seeking in C:\localhost\content.php on line 20

    I am not sure what alternative I can use, are there any tutorials on this kind of function anywhere?
  6. #4
  7. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2006
    Location
    India
    Posts
    857
    Rep Power
    548
    You can't use seek() like this.

    1. Read all the content at once [ use [PHPNET="file_get_content"]file_get_content()[/PHPNET] ].
    2. Search the required string from the string retrieved.
    3. Return true/false accordingly.
    Akash Dwivedi
    "Whatever the mind can conceive and believe, the mind can achieve."
    Feel good..


  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11
    Thanks for that, how do I return all the content between certain tags?
  10. #6
  11. Prisoner of the Sun

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,309
    Rep Power
    2351
    Parsing can be confusing but it's not that tricky. And if the content on the page you're scraping changes, it may well break your parser - that's just the way it goes.

    It's usually a good start to do this:
    PHP Code:
    $lines explode("\n"$page_content); 
    Then you should have an array of all the lines in the page.
    If you want the contents of a table you can use:
    PHP Code:
    explode('<table>'$page_content); 
    It depends how much content is between the tags, but you can keep on doing that sort of thing until you get down to the data you need.
    You could do the same kind of thing with [PHPNET="preg_replace"]preg_replace()[/PHPNET], which might take less code but will be slower.
    It all depends on the page content/structure.

    There's a really good function in the first example (example 1536) of preg_replace() in the PHP manual. It strips out all the structure and leaves you with just the data (you might need to modify it slightly).
    Last edited by b3n; May 2nd, 2007 at 07:56 AM.
    .
    :: My blip.fm tunes :: Web Design Feeds :: Web Dev Feeds :: CheatSheets :: PHP :: MySQL :: 13 Moon FB App.

    "All matter is merely energy condensed to a slow vibration. We are all one consciousness experiencing itself - subjectively. There is no such thing as death, life is only a dream. We are the imaginations of ourselves."
    - Bill Hicks


    "Truth is hidden in the subtle nature of the heart of everything, although it is invisible. One cannot see it from inside and neither from the surface. One can only live and experience it."
    - Heart Sutra
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11
    Hi B3n,

    I cannot find the example you are referring to.

    The first example here http://uk.php.net/preg_replace does not relate to the problem I am having? Please can you show me where it is.

    Thanks


    #2
    What I ideally want to do is apply a file to a variable. Then find the "bytes" position where it matches $string1. Then find the "bytes" position where it matches $string2.

    I suppose I could then use

    Code:
    <?
    $content=file_get_contents("http://www.google.com",FALSE,NULL,$position1,$position2);
    echo $content;
    ?>
    Last edited by gurmukh; May 2nd, 2007 at 09:59 AM.
  14. #8
  15. Prisoner of the Sun

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,309
    Rep Power
    2351
    Sorry, the example I meant isn't on that page anymore! But there is better looking one in the user notes, tsk tsk

    Anyway here they both are. The more recent one looks good but I've never used it:
    PHP Code:
    <?php
    /**
    * strip_selected_tags ( string str [, string strip_tags[, strip_content flag]] )
    * ---------------------------------------------------------------------
    * Like strip_tags() but inverse; the strip_tags tags will be stripped, not kept.
    * strip_tags: string with tags to strip, ex: "<a><p><quote>" etc.
    * strip_content flag: TRUE will also strip everything between open and closed tag
    */
    function strip_selected_tags($str$tags ""$stripContent false)
    {
        
    preg_match_all("/<([^>]+)>/i"$tags$allTagsPREG_PATTERN_ORDER);
        foreach (
    $allTags[1] as $tag) {
        
    $replace "%(<$tag.*?>)(.*?)(<\/$tag.*?>)%is";
            if (
    $stripContent) {
                
    $str preg_replace($replace,'',$str);
            }
                
    $str preg_replace($replace,'${2}',$str);
        }
        return 
    $str;
    }
    ?>
    With this next one, you only really need the first few patterns, and you can delete the rest.
    PHP Code:
    <?php
    // $document should contain an HTML document.
    // This will remove HTML tags, javascript sections
    // and white space. It will also convert some
    // common HTML entities to their text equivalent.
    $search = array ('@<script[^>]*?>.*?</script>@si'// Strip out javascript
                    
    '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                    
    '@([\r\n])[\s]+@',                // Strip out white space
                    
    '@&(quot|#34);@i',                // Replace HTML entities
                    
    '@&(amp|#38);@i',
                    
    '@&(lt|#60);@i',
                    
    '@&(gt|#62);@i',
                    
    '@&(nbsp|#160);@i',
                    
    '@&(iexcl|#161);@i',
                    
    '@&(cent|#162);@i',
                    
    '@&(pound|#163);@i',
                    
    '@&(copy|#169);@i',
                    
    '@&#(\d+);@e');                    // evaluate as php

    $replace = array ('',
                     
    '',
                     
    '\1',
                     
    '"',
                     
    '&',
                     
    '<',
                     
    '>',
                     
    ' ',
                     
    chr(161),
                     
    chr(162),
                     
    chr(163),
                     
    chr(169),
                     
    'chr(\1)');

    $text preg_replace($search$replace$document);
    ?>
    .
    :: My blip.fm tunes :: Web Design Feeds :: Web Dev Feeds :: CheatSheets :: PHP :: MySQL :: 13 Moon FB App.

    "All matter is merely energy condensed to a slow vibration. We are all one consciousness experiencing itself - subjectively. There is no such thing as death, life is only a dream. We are the imaginations of ourselves."
    - Bill Hicks


    "Truth is hidden in the subtle nature of the heart of everything, although it is invisible. One cannot see it from inside and neither from the surface. One can only live and experience it."
    - Heart Sutra
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11
    Thanks again for your reply, but I dont want to strip tags, I simply want to extract a certain amount of content between two comment lines.

    I want to only print what is around

    "<!-- BEGIN: Module - Main Article-->"

    ...
    print this content
    ...


    "<!-- end -->"

    I am so stuck, I have seen this code

    <?
    $content=file_get_contents("http://www.google.com",FALSE,NULL,$position1,$position2);
    echo $content;
    ?>

    Which prints all characters between $position1 and $position2. I think what I need to do next is find out a way to find a match for "<!-- BEGIN: Module - Main Article-->" and assign that to a variable. Do you know what command I could use for that?

    Many thanks for your invaluable help!
  18. #10
  19. Prisoner of the Sun

    Join Date
    Jul 2004
    Location
    The Mews At Windsor Heights
    Posts
    5,309
    Rep Power
    2351
    What I was saying is that you sometimes have to gradually keep stripping things out until you are left with the bit you need.

    The positions you pass to file_get_contents() are bytes, not strings, so I don't think that's a reliable way to get the data.

    If that comment only appears once then you can probably use [PHPNET="strstr"]strstr()[/PHPNET]
    PHP Code:
    <?php
    $content 
    file_get_contents('http://www.google.com'FALSE);
    $start '<!-- BEGIN: Module - Main Article-->';
    $data_bit strstr($content$start);

    // now  you need to strip off everything after "<!-- end -->"

    $array explode('<!-- end -->'$data_bit);

    // The first comment will still be there so strip it off (36 chars)
    $data substr($array[0], 036);

    echo 
    $data;
    ?>
    .
    :: My blip.fm tunes :: Web Design Feeds :: Web Dev Feeds :: CheatSheets :: PHP :: MySQL :: 13 Moon FB App.

    "All matter is merely energy condensed to a slow vibration. We are all one consciousness experiencing itself - subjectively. There is no such thing as death, life is only a dream. We are the imaginations of ourselves."
    - Bill Hicks


    "Truth is hidden in the subtle nature of the heart of everything, although it is invisible. One cannot see it from inside and neither from the surface. One can only live and experience it."
    - Heart Sutra
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    74
    Rep Power
    11
    Thanks for that! I had to slightly tweak it

    Code:
    <?php 
    $content = file_get_contents('http://www.timesonline.co.uk/tol/news/world/middle_east/article1731131.ece', FALSE); 
    $start = '<!--  BEGIN: Module - Main Article -->'; 
    $data_bit = strstr($content, $start); 
    
    // now  you need to strip off everything after "<!-- end -->" 
    
    $array = explode('<!--#include file="m63-article-related-attachements.html"-->', $data_bit); 
    
    // The first comment will still be there so strip it off (36 chars) 
    $data = substr($array[0],245); 
    
    echo $data; 
    ?>
  22. #12
  23. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2006
    Location
    India
    Posts
    857
    Rep Power
    548
    use PHP tags instead of CODE for PHP code
    OR use language syntax highlighter
    Akash Dwivedi
    "Whatever the mind can conceive and believe, the mind can achieve."
    Feel good..



IMN logo majestic logo threadwatch logo seochat tools logo