PHP-General - Two Noob coding Q's on searching URL's
Discuss Two Noob coding Q's on searching URL's in the PHP Development forum on Dev Shed. Two Noob coding Q's on searching URL's PHP Development forum discussing coding practices, tips on PHP, and other PHP-related topics. PHP is an open source scripting language that has taken the web development industry by storm.
Posts: 14
Time spent in forums: 2 h 53 m 9 sec
Reputation Power: 0
PHP-General - Two Noob coding Q's on searching URL's
I'm brand new to PHP coding (but was a old-time VB coder). I figured out how to search a URL for a string using file_get_contents. I also understand I can use DOMDocument but that approach seems confusing to me at this early stage. What I'm looking to figure out is two-fold:
1. I want search a specific site and grab some content there that's date-specific. That date is listed in the site's Title tag so the title tag will say something like: <title>This weekend's stats for February 22-23, 2013 ---</title>.
I can search for the title tag (and find it) but how to I go about actually turning just the key piece I need into a string? Essentially the "middle piece of my string is dynamic. I'm sure regex is involved.
2. Also on this page I need to find roughly 80 pieces of data that will be all the same except for one unknown string in the middle of 80 TD/URL/name tags (movie titles). How do I get the 80 items into an array? I guess I'll use a FOR loop or WHILE loop but what's the code for this look like to get those lines into strings?
Thanks. At this point I'm hoping someone would be kind enough to provide a code snippet for each of the above.
Posts: 2,907
Time spent in forums: 1 Year 1 Month 1 Day 22 h 53 m 3 sec
Reputation Power: 581
1) If you are just looking for the title part then that is simple enough with strpos. Find the starting title tag then locate the closing >. From that index find the closing title tag (</title>) and everything in between is what you want.
2) Once you have the string you want extracted simply use $myarray[]=$mystring. That will append the string to the array as a new element. You can get the size of the array later with count.
__________________
There are 10 kinds of people in the world. Those that understand binary and those that don't.
Posts: 14
Time spent in forums: 2 h 53 m 9 sec
Reputation Power: 0
Quote:
Originally Posted by gw1500se
1) If you are just looking for the title part then that is simple enough with strpos. Find the starting title tag then locate the closing >. From that index find the closing title tag (</title>) and everything in between is what you want.
As noted, I just want the dynamic date, not the rest. I need to find the <title> tag as that's where the date is reliably stored.
Quote:
2) Once you have the string you want extracted simply use $myarray[]=$mystring. That will append the string to the array as a new element. You can get the size of the array later with count.
Can you provide the above in a code example as I'm not quite seeing this clearly?
Posts: 8,063
Time spent in forums: 2 Months 1 Day 6 h 49 m 31 sec
Reputation Power: 7104
I do not understand the response you received in this thread.
This is cross posted on another board, but was not posted at the same time. The OP did not receive the help they were looking for on the other board, and so created a new thread here instead.
Generally speaking there are three ways to tackle the type of problem you have:
(1) Using DOMDocument
(2) Using Regex
(3) Using string handling functions
I understand that DOMDocument is confusing at this point, but for the sake of completeness I will mention that it is the most correct approach to take here. However, there are situations in which DOMDocument can not be applied (such as when you are not parsing XML or HTML) and you would need to use method (2) or (3) instead, so it is worth understanding those as well.
What you're trying to do can be solved with a regular expression. If you feel DOMDocument is confusing, I suspect you will feel the same about regular expressions though. To extract the title, you can use preg_match. To extract the table cells, you can use preg_match_all. They function nearly the same, except that preg_match_all allows you to match multiple occurrences of the pattern.
The difficult part of using a regex is writing the regex itself. Generally if you have an arbitrary unknown string between two known points, you can match it with the pattern: (.+?)
For example:
For matching the tr/td you would do something similar, but without knowing the structure it's difficult to give a relevant example. Keep in mind that if the pattern you are trying to match with (.+?) contains newlines you will need to pass the s pattern modifier to preg_match or preg_match_all. This modifier is required for the . character to match newlines. the PHP manual has an entire page dedicated to pattern modifiers which explain these in more detail.
If you call preg_match_all, the third parameter will be an array containing all of found matches.
The simplest approach is to use string handling functions to extract pieces of the document. Generally you'll be using strpos and substr for this. To extract a specific piece of a document, generally you:
(1) find the starting point
(2) find the ending point
(3) extract the text in between
If you are trying to extract multiple similar values you can use a loop. The third parameter to strpos is an offset value. At the end of each loop iteration, you need to set a variable that indicates the point in the document at which the iteration ended. Then, at the start of the next iteration, you use that value as the offset. This way you do not enter an infinite loop (constantly processing the same segment of the document).
strpos will return false if the pattern is not found. This is how you know you're at the end of the document. You can check for this using the === operator:
Posts: 14
Time spent in forums: 2 h 53 m 9 sec
Reputation Power: 0
Thank you so much for the reply. As noted, I did order the book that was mentioned (both digital and hard copy) and do acknowledge that I have a long road ahead of me. I greatly appreciate the time you took to help.
What I plan to do now is take this information and immediately attempt to resolve the approach all three ways you mention so that I understand each of them.
Interestingly, regex makes more sense to me than I expected. Yes, the verbiage is a bit foreign but with a legend it's easily "deciphered". hehe
Posts: 14
Time spent in forums: 2 h 53 m 9 sec
Reputation Power: 0
On the DOMDocument approach (and this is my first test ever using it) I tried the snippet below. It's a double test to see if I can find the title piece basically and then try to load $title with the title tag.
I commented the print command that shows me I'm seeing the whole page (and it works). However, the below code is not getting me the title tag. What am I missing?
Posts: 8,063
Time spent in forums: 2 Months 1 Day 6 h 49 m 31 sec
Reputation Power: 7104
First, you should enable display errors and error reporting when you're writing code. In the FAQ, there are instructions for several methods of enabling those. When you do that, you'll see two error messages show up when you run your code:
Code:
Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag expected, '<' not found in Entity, line: 1
There are two problems here:
(1) The document you're loading is HTML, not XML, so you need to use loadHTML instead of loadXML.
(2) Both loadHTML and loadXML require you to pass the contents of the document, not the URL to it.
The second error is:
Code:
Catchable fatal error: Object of class DOMNodeList could not be converted to string
This occurs because getElementsByTagName returns a DOMNodeList object, not a string. A DOMNodeList object is similar to an array in that it's designed as a container for multiple elements (all of the matched tags).
On the DOMNodeList documentation page, you can see that DOMNodeList has a method called 'item' that you can use to retrieve a matching DOMNode object (indexes start at 0).
If you review the documentation page for DOMNode, you'll see that it has a nodeValue property, which is what will ultimately contain the string you're trying to print.
PHP Code:
$title->item(0)->nodeValue;
It's worth noting that DOMDocument is easily one of the most complicated and most poorly documented pieces of PHP.
Right now that gives me $title being equal "March 1-3, 2013".
That's exactly what I was looking for. Of course if I posted my exact code you'd want to jump due to literally every line being heavily commented (which is what I do all along the learning curve).
Another question for clarity on the DOM approach which, having seen it now, IS the way I want to go.
You said loadHTML() wants the CONTENTS of the document. In this case wouldn't that be held in the string $urlcontents?
Just adding the two lines below generates roughly 125 errors:
PHP Code:
$dom = new DOMDocument;
$dom->loadHTML($urlcontents);
Posts: 8,063
Time spent in forums: 2 Months 1 Day 6 h 49 m 31 sec
Reputation Power: 7104
Quote:
You said loadHTML() wants the CONTENTS of the document. In this case wouldn't that be held in the string $urlcontents?
Yes
Quote:
Just adding the two lines below generates roughly 125 errors:
That's because the HTML on the page is syntactically invalid. This is one of the few situations where's it's appropriate to apply PHP's error suppression operator.
Posts: 14
Time spent in forums: 2 h 53 m 9 sec
Reputation Power: 0
Oreo,
Hoping you might shed some light on something I'm encountering that's confusing.
I'm looking at the source for the following URL:
http://www.boxofficemojo.com/weekend/chart/
Using DOM presents a small problem as the title's simple but grabbing roughly 100 table row tags is a bit more involved especially since the table isn't named. I actually suspect it'd be simpler to just grab this with a preg_match. Testing it I basically have the following code:
PHP Code:
//Set a new needle to find the movies for the week.
//This string is unique to just the entries I need.
$needle = "<td><font size=\"2\"><a href=\"/movies/?id=";
$test = strpos($urlcontents, $needle);
echo $test;
preg_match($needle, $urlcontents, $matches);
$urlcontents is setup with a file_get_contents above that code.
The issue is two-fold:
1. It generates an odd error telling me:
PHP Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Unknown modifier '<' in searchurl.php on line 58
2. The $test is returning a different number on most runs. The number is always similar but not the same.
Posts: 8,063
Time spent in forums: 2 Months 1 Day 6 h 49 m 31 sec
Reputation Power: 7104
Quote:
1. It generates an odd error telling me:
PHP Warning: preg_match() [<a href='function.preg-match'>function.preg-match</a>]: Unknown modifier '<' in searchurl.php on line 58
The value in $needle isn't a useful regular expression. The first character of $needle is <, so that becomes the pattern delimiter. That means that the regular expression ends on the next <, which is the one right before font. The rest of the string is treated as pattern modifiers, which are mostly invalid. $needle isn't useful as a regular expression because it doesn't have any dynamic parts.
Quote:
2. The $test is returning a different number on most runs. The number is always similar but not the same.
The page probably has some dynamic content on it somewhere that changes every time you request it.