Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0

    Two Noob coding Q's on searching URL's


    I'm brand new to PHP coding (but was a old-time VB coder). I figured out how to search a URL for a string using file_get_contents. I also understand I can use DOMDocument but that approach seems confusing to me at this early stage. What I'm looking to figure out is two-fold:

    1. I want search a specific site and grab some content there that's date-specific. That date is listed in the site's Title tag so the title tag will say something like: <title>This weekend's stats for February 22-23, 2013 ---</title>.

    I can search for the title tag (and find it) but how to I go about actually turning just the key piece I need into a string? Essentially the "middle piece of my string is dynamic. I'm sure regex is involved.

    2. Also on this page I need to find roughly 80 pieces of data that will be all the same except for one unknown string in the middle of 80 TD/URL/name tags (movie titles). How do I get the 80 items into an array? I guess I'll use a FOR loop or WHILE loop but what's the code for this look like to get those lines into strings?

    Thanks. At this point I'm hoping someone would be kind enough to provide a code snippet for each of the above.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2003
    Posts
    3,572
    Rep Power
    595
    1) If you are just looking for the title part then that is simple enough with strpos. Find the starting title tag then locate the closing >. From that index find the closing title tag (</title>) and everything in between is what you want.
    2) Once you have the string you want extracted simply use $myarray[]=$mystring. That will append the string to the array as a new element. You can get the size of the array later with count.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. Mad Scientist
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Oct 2007
    Location
    North Yorkshire, UK
    Posts
    3,661
    Rep Power
    4124
    Unless I'm missing something you also need the URL of the page you are going to "scrape"?

    Have you thought about that yet?
    I said I didn't like ORM!!! <?php $this->model->update($this->request->resources[0])->set($this->request->getData())->getData('count'); ?>

    PDO vs mysql_* functions: Find a Migration Guide Here

    [ Xeneco - T'interweb Development ] - [ Are you a Help Vampire? ] - [ Read The manual! ] - [ W3 methods - GET, POST, etc ] - [ Web Design Hell ]
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    Yes. That's what gets passed to file_get_contents:

    $url="targeturl";

    $str=file_get_contents($url);
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    Originally Posted by gw1500se
    1) If you are just looking for the title part then that is simple enough with strpos. Find the starting title tag then locate the closing >. From that index find the closing title tag (</title>) and everything in between is what you want.
    As noted, I just want the dynamic date, not the rest. I need to find the <title> tag as that's where the date is reliably stored.

    2) Once you have the string you want extracted simply use $myarray[]=$mystring. That will append the string to the array as a new element. You can get the size of the array later with count.
    Can you provide the above in a code example as I'm not quite seeing this clearly?
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2003
    Posts
    3,572
    Rep Power
    595
    Once you find the title with strpos as I stated, you can extract whatever you need from the resulting string.

    Perhaps if you posted the code where you extract the string you want I can show you better.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    If I had code to post I wouldn't really need the help. hehe

    As I said, I'm brand new to PHP.

    What I have so far is setting up the URL string, calling file_get_contents and loading that into $str.

    I then ASSUME I need to set a regex pattern (no idea what that should look like) and then follow that with a ..... preg_match?
  14. #8
  15. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    I do not understand the response you received in this thread.

    This is cross posted on another board, but was not posted at the same time. The OP did not receive the help they were looking for on the other board, and so created a new thread here instead.

    Generally speaking there are three ways to tackle the type of problem you have:
    (1) Using DOMDocument
    (2) Using Regex
    (3) Using string handling functions

    I understand that DOMDocument is confusing at this point, but for the sake of completeness I will mention that it is the most correct approach to take here. However, there are situations in which DOMDocument can not be applied (such as when you are not parsing XML or HTML) and you would need to use method (2) or (3) instead, so it is worth understanding those as well.

    What you're trying to do can be solved with a regular expression. If you feel DOMDocument is confusing, I suspect you will feel the same about regular expressions though. To extract the title, you can use preg_match. To extract the table cells, you can use preg_match_all. They function nearly the same, except that preg_match_all allows you to match multiple occurrences of the pattern.

    The difficult part of using a regex is writing the regex itself. Generally if you have an arbitrary unknown string between two known points, you can match it with the pattern: (.+?)
    For example:
    PHP Code:
    $title "<title>abc " rand(0,100) . "</title>";
    preg_match("#<title>abc (.+?)</title>#"$title$match);
    print_r($match); 
    For matching the tr/td you would do something similar, but without knowing the structure it's difficult to give a relevant example. Keep in mind that if the pattern you are trying to match with (.+?) contains newlines you will need to pass the s pattern modifier to preg_match or preg_match_all. This modifier is required for the . character to match newlines. the PHP manual has an entire page dedicated to pattern modifiers which explain these in more detail.

    If you call preg_match_all, the third parameter will be an array containing all of found matches.


    The simplest approach is to use string handling functions to extract pieces of the document. Generally you'll be using strpos and substr for this. To extract a specific piece of a document, generally you:
    (1) find the starting point
    (2) find the ending point
    (3) extract the text in between

    For example:
    PHP Code:
    $document "abc
    <title>xyz</title>
    bca"
    ;

    $start strpos($document'<title>')+7;
    $end strpos($document'</title>');
    $title substr($document$start$end-$start); 
    If you are trying to extract multiple similar values you can use a loop. The third parameter to strpos is an offset value. At the end of each loop iteration, you need to set a variable that indicates the point in the document at which the iteration ended. Then, at the start of the next iteration, you use that value as the offset. This way you do not enter an infinite loop (constantly processing the same segment of the document).

    strpos will return false if the pattern is not found. This is how you know you're at the end of the document. You can check for this using the === operator:
    PHP Code:
    if(strpos($document$string) === FALSE) die("END"); 

    Comments on this post

    • Jacques1 disagrees : So you delete all other replies, because you think only your own answer is the correct one? You're the mod of the day. And now go get your "counter-reps" from your friends.
    Last edited by E-Oreo; March 7th, 2013 at 01:29 AM.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    Thank you so much for the reply. As noted, I did order the book that was mentioned (both digital and hard copy) and do acknowledge that I have a long road ahead of me. I greatly appreciate the time you took to help.

    What I plan to do now is take this information and immediately attempt to resolve the approach all three ways you mention so that I understand each of them.

    Interestingly, regex makes more sense to me than I expected. Yes, the verbiage is a bit foreign but with a legend it's easily "deciphered". hehe
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    On the DOMDocument approach (and this is my first test ever using it) I tried the snippet below. It's a double test to see if I can find the title piece basically and then try to load $title with the title tag.

    I commented the print command that shows me I'm seeing the whole page (and it works). However, the below code is not getting me the title tag. What am I missing?


    PHP Code:
    $url "http://www.boxofficemojo.com/weekend/chart/"
    $urlcontents file_get_contents($url);
    $needle "<title>Weekend Box Office Results for ";

    if (
    strpos($urlcontents$needle) == false) { 
    echo 
    "String not found"; } 
    else { 
    echo 
    "String found"; } 

    //print "$urlcontents";

    $dom = new DOMDocument
    $dom->loadXML($url);
    $title $dom->getElementsByTagName('title');

    print 
    "$title"
  20. #11
  21. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    First, you should enable display errors and error reporting when you're writing code. In the FAQ, there are instructions for several methods of enabling those. When you do that, you'll see two error messages show up when you run your code:

    Code:
    Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, '<' not found in Entity, line: 1
    There are two problems here:
    (1) The document you're loading is HTML, not XML, so you need to use loadHTML instead of loadXML.
    (2) Both loadHTML and loadXML require you to pass the contents of the document, not the URL to it.

    The second error is:
    Code:
    Catchable fatal error: Object of class DOMNodeList could not be converted to string
    This occurs because getElementsByTagName returns a DOMNodeList object, not a string. A DOMNodeList object is similar to an array in that it's designed as a container for multiple elements (all of the matched tags).

    On the DOMNodeList documentation page, you can see that DOMNodeList has a method called 'item' that you can use to retrieve a matching DOMNode object (indexes start at 0).

    If you review the documentation page for DOMNode, you'll see that it has a nodeValue property, which is what will ultimately contain the string you're trying to print.

    PHP Code:
    $title->item(0)->nodeValue
    It's worth noting that DOMDocument is easily one of the most complicated and most poorly documented pieces of PHP.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    Thanks. Lots of minor successes after your help. The main goal of the Title tag search was simply to grab the date in question.

    The date started 38 characters after the start of the title tag so now I have this:

    PHP Code:
    $url "http://www.boxofficemojo.com/weekend/chart/";
    $urlcontents file_get_contents($url);
    $start strpos($urlcontents'<title>')+38;
    $end strpos($urlcontents' - Box Office Mojo</title>');
    $title substr($urlcontents$start$end-$start); 
    Right now that gives me $title being equal "March 1-3, 2013".

    That's exactly what I was looking for. Of course if I posted my exact code you'd want to jump due to literally every line being heavily commented (which is what I do all along the learning curve).

    Another question for clarity on the DOM approach which, having seen it now, IS the way I want to go.

    You said loadHTML() wants the CONTENTS of the document. In this case wouldn't that be held in the string $urlcontents?

    Just adding the two lines below generates roughly 125 errors:

    PHP Code:
    $dom = new DOMDocument;
    $dom->loadHTML($urlcontents); 
  24. #13
  25. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    You said loadHTML() wants the CONTENTS of the document. In this case wouldn't that be held in the string $urlcontents?
    Yes

    Just adding the two lines below generates roughly 125 errors:
    That's because the HTML on the page is syntactically invalid. This is one of the few situations where's it's appropriate to apply PHP's error suppression operator.
    PHP Code:
    @$dom->loadHTML($urlcontents); 
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  26. #14
  27. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    14
    Rep Power
    0
    Oreo,

    Hoping you might shed some light on something I'm encountering that's confusing.

    I'm looking at the source for the following URL:

    http://www.boxofficemojo.com/weekend/chart/

    Using DOM presents a small problem as the title's simple but grabbing roughly 100 table row tags is a bit more involved especially since the table isn't named. I actually suspect it'd be simpler to just grab this with a preg_match. Testing it I basically have the following code:

    PHP Code:
    //Set a new needle to find the movies for the week. 
    //This string is unique to just the entries I need.
    $needle "<td><font size=\"2\"><a href=\"/movies/?id="
    $test strpos($urlcontents$needle);
     echo 
    $test
    preg_match($needle$urlcontents$matches); 
    $urlcontents is setup with a file_get_contents above that code.

    The issue is two-fold:

    1. It generates an odd error telling me:

    PHP Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Unknown modifier '&lt;' in searchurl.php on line 58

    2. The $test is returning a different number on most runs. The number is always similar but not the same.

    Any ideas?

    Thanks.
  28. #15
  29. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    1. It generates an odd error telling me:
    PHP Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Unknown modifier '<' in searchurl.php on line 58
    The value in $needle isn't a useful regular expression. The first character of $needle is <, so that becomes the pattern delimiter. That means that the regular expression ends on the next <, which is the one right before font. The rest of the string is treated as pattern modifiers, which are mostly invalid. $needle isn't useful as a regular expression because it doesn't have any dynamic parts.

    2. The $test is returning a different number on most runs. The number is always similar but not the same.
    The page probably has some dynamic content on it somewhere that changes every time you request it.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo