#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0

    Parsing URL string into (multiple) usable parts


    You guys were great in helping me out in a similar endeavor a few months ago, so I figured I would go back to the well for some guidance with my latest round of tinkering with URL strings.

    For reference, the URL string I am trying to parse out is here:
    http://soccernet.espn.go.com/bottoml...resSource=euro

    Running the simple code:
    PHP Code:
    $page file_get_contents("http://soccernet.espn.go.com/bottomline/scores/scores?scoresSource=euro");
    preg_match_all("/&([^=]+)=([^&]+)/"urldecode($page), $foo);
    foreach ( 
    $foo[1] as $key => $value ) {
      echo 
    "{$value} = {$foo[2][$key]}\t<br />\n"

    gives me these results:
    Code:
    EUROSOC_s_delay = 120
    EUROSOC_s_stamp = 20130926473074
    EUROSOC_s_left1 = Internazionale v Fiorentina (2:45 PM ET)
    EUROSOC_s_right1_1 = Italian Serie A
    EUROSOC_s_url1 = http://soccernet.espn.go.com/preview?id=377264
    EUROSOC_s_left2 = Athletic Bilbao v Real Betis (2:00 PM ET)
    EUROSOC_s_right2_1 = Spanish Primera Divisi
    #243;n&EUROSOC_s_url2 = http://soccernet.espn.go.com/preview?id=373162
    EUROSOC_s_left3 = Getafe v Celta Vigo (4:00 PM ET)
    EUROSOC_s_right3_1 = Spanish Primera Divisi
    #243;n&EUROSOC_s_url3 = http://soccernet.espn.go.com/preview?id=373158
    EUROSOC_s_left4 = Villarreal v Espanyol (4:00 PM ET)
    EUROSOC_s_right4_1 = Spanish Primera Divisi
    #243;n&EUROSOC_s_url4 = http://soccernet.espn.go.com/preview?id=373159
    EUROSOC_s_count = 4
    EUROSOC_s_loaded = true
    The first issue I have is that there appears to be an error in the url string itself. For some reason, the letter "o" is being represented by "#243;" in "Spanish Primera Division", which is jacking up the preg_match function. Don't know how much that is going to impact what I am looking to do as a whole. As it's a problem with the source data itself, there's nothing I can do about it, but thought I would mention it anyway.

    Here's what I can't figure out how to do... when the url string is fetched using file_get_contents, I want to break out the results further so that only results such as this:
    Code:
    EUROSOC_s_left1 = Internazionale v Fiorentina (2:45 PM ET)
    EUROSOC_s_right1_1 = Italian Serie A
    EUROSOC_s_url1 = http://soccernet.espn.go.com/preview?id=377264
    (as well as any others with a "_right" value of "Italian Serie A") are returned. I've used regex before on these type of url strings, but only to return things as a whole. Would the process for using the regex to do what I am looking to do be the same?

    Any help you all could give would be much appreciated!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,383
    Rep Power
    594
    Why would you not use parse_url?
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by gw1500se
    Why would you not use parse_url?
    I'm not sure I follow you... Wouldn't parse_url give me somewhat the same result as using the file_get_contents and preg_match_all combination I described in my original post?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 2003
    Posts
    3,383
    Rep Power
    594
    Apples and oranges. get_file_contents returns the actual HTML, in which case I would use DOM to parse it. However, your post seems to imply you want to parse the URL itself not the HTML to which it points.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  8. #5
  9. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    The URL string is more or less broken. The people who wrote this ran some (all?) parameters through an HTML escaping function (using the old ASCII encoding). This turns characters like "ó" (an "o" with an accent) into HTML entities like &#243.

    When you decode the URL, you bring up those "&" characters from the HTML entities, and they clash with the "&" characters separating the query parameters.

    Parsing the URL by hand is a bad idea, anyway. But parse_url() makes no sense either. What you want is parse_str(). This parses the query part of a URL.

    The procedure is this:

    1. Run the URL through parse_str()
    2. URL-decode the parameters
    3. HTML-decode the parameters

    You might consider sending a bug report to that site telling them about the HTML entities problem.
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0
    So Jacques, your response about 1st attempting to fix the url string output set off an idea in my head... all I did was run things through a couple iterations of str_replace:
    PHP Code:
    $url 'http://soccernet.espn.go.com/bottomline/scores/scores?scoresSource=euro';
    $content file_get_contents($url);
    $fix str_replace("#243;n""o"$content);
    $fixed str_replace("%26%23243%3B""o"$fix);

    preg_match_all("/&([^=]+)=([^&]+)/"urldecode($fixed), $foo);

    foreach ( 
    $foo[1] as $key => $value ){

    echo 
    "{$value} = {$foo[2][$key]}\t<br />\n";

    and I end up with a much cleaner output:
    Code:
    EUROSOC_s_delay = 120
    EUROSOC_s_stamp = 20130926482647
    EUROSOC_s_left1 = Internazionale 0 - 0 Fiorentina (First Half)
    EUROSOC_s_right1_1 = Italian Serie A
    EUROSOC_s_url1 = http://soccernet.espn.go.com/match?id=377264
    EUROSOC_s_left2 = Athletic Bilbao 0 - 0 Real Betis (First Half)
    EUROSOC_s_right2_1 = Spanish Primera Division
    EUROSOC_s_url2 = http://soccernet.espn.go.com/match?id=373162
    EUROSOC_s_left3 = Getafe v Celta Vigo (4:00 PM ET)
    EUROSOC_s_right3_1 = Spanish Primera Division
    EUROSOC_s_url3 = http://soccernet.espn.go.com/preview?id=373158
    EUROSOC_s_left4 = Villarreal v Espanyol (4:00 PM ET)
    EUROSOC_s_right4_1 = Spanish Primera Division
    EUROSOC_s_url4 = http://soccernet.espn.go.com/preview?id=373159
    EUROSOC_s_count = 4
    EUROSOC_s_loaded = true
    Now, as for using parse_str to get the desired outcome, I am at a loss, because of the dependencies involved. The game scores are all indicated with an index of "_left(digit) = " . Thus, there are four scores. However, like I said in my original post, the goal here is to take the output above and pull out only those for a certain European football league. The league is indicated with an index of "_right(digit)_1" . Example.. say I only want scores returned for games in the Italian Serie A. With the current data, that would mean this would be the only score returned:

    Code:
    EUROSOC_s_left1 = Internazionale 0 - 0 Fiorentina (First Half)
    EUROSOC_s_right1_1 = Italian Serie A
    EUROSOC_s_url1 = http://soccernet.espn.go.com/match?id=377264
    I'm not sure how to set things up with parse_str so that it checks the "_right" index for "Italian Serie A", and returns the values from the corresponding "_left(digit) = " and "_url(digit) = " index values if true. I have a good enough grasp on the parse_str function to get the job done if I wanted to pull everything, but I this seems like for what I am looking to do, it requires a bit of working backwards... Am I totally off with that assumption? Any hints, tips or otherwise?
  12. #7
  13. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    First of all, manually removing that "ó" stuff is a terrible idea. Who says that's the only special character that will ever appear in the data for the lifetime of your application? It most likely won't be the only one.

    This is a general problem, so it needs a general solution. The problem is not that some evil Spaniard injected a "ó" into the string to annoy you. The problem is that any character outside of the ASCII range is represented as an HTML entity. So forget about the "ó" and use the steps described above.

    Next thing is that you confuse two different things: Parsing and filtering the data have nothing to do with each other. Sure, you could do both at the same time using those ugly regexes. But I recommend you don't. First parse the data with str_parse(). And then filter it by checking the array indices with a regex.

    This will reduce your code to a few simple lines.

    The underlying problem, however, is that the data source is extremely poor, because it's an unstructured bunch of values. Is this actually the official API? Don't they have XML or JSON or something?
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Jacques1
    First of all, manually removing that "ó" stuff is a terrible idea. Who says that's the only special character that will ever appear in the data for the lifetime of your application? It most likely won't be the only one.

    This is a general problem, so it needs a general solution. The problem is not that some evil Spaniard injected a "ó" into the string to annoy you. The problem is that any character outside of the ASCII range is represented as an HTML entity. So forget about the "ó" and use the steps described above.

    Next thing is that you confuse two different things: Parsing and filtering the data have nothing to do with each other. Sure, you could do both at the same time using those ugly regexes. But I recommend you don't. First parse the data with str_parse(). And then filter it by checking the array indices with a regex.

    This will reduce your code to a few simple lines.

    The underlying problem, however, is that the data source is extremely poor, because it's an unstructured bunch of values. Is this actually the official API? Don't they have XML or JSON or something?
    No, this is not the official API and no XML or JSON source is available to my knowledge. The URLs are from from ESPN's bottom line desktop widget. There's a separate one for all the different professional sports leagues, except when it comes to soccer. Hence my current parsing endeavor. What I've done with the other URLs (e.g. MLB, NFL, NBA, etc) is strip them down so that each individual score is reported as an rss item. I then use these in a rss reader on my website and viola... I have a homebrewed sports score ticker. But anyway...

    So I finally got a moment to run things through parse_str:
    PHP Code:
    $url 'http://espnfc.com/bottomline/scores/scores?scoresSource=euro';
    $str file_get_contents($url);

    parse_str($str$myArray);
    print_r ($myArray); 
    and the result was quite interesting:
    Code:
    Array ( [EUROSOC_s_delay] => 120 [EUROSOC_s_stamp] => 20130927461652 [EUROSOC_s_left1] => Real Valladolid v Malaga (3:00 PM ET) [EUROSOC_s_right1_1] => Spanish Primera División [EUROSOC_s_url1] => http://soccernet.espn.go.com/preview?id=373153 [EUROSOC_s_count] => 1 [EUROSOC_s_loaded] => true )
    The "ó" problem is gone. Now, am I wrong for being surprised for that happening, or was that result to be expected?
  16. #9
  17. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Like I said: The actual values of the URL are encoded as HTML entities. If you view the URL-decoded string with a browser, you'll see the original characters.

    Technically, this is bad, because you're supposed to get the raw content and not some preprocessed stuff. But if the only thing you'll ever do with the data is outputting it on the screen, then this bug shouldn't be a problem.

    Otherwise, decode the HTML entities as mentioned above (with html_entity_decode()).
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Jacques1
    Like I said: The actual values of the URL are encoded as HTML entities. If you view the URL-decoded string with a browser, you'll see the original characters.

    Technically, this is bad, because you're supposed to get the raw content and not some preprocessed stuff. But if the only thing you'll ever do with the data is outputting it on the screen, then this bug shouldn't be a problem.

    Otherwise, decode the HTML entities as mentioned above (with html_entity_decode()).
    Correct, I am only going to output the data on the screen via rss.

    To be honest, I am really stuck at how to finish what I need by filtering using regex. Can someone please give me a hint about how to set this up?

    Running parse_str on the current data:
    PHP Code:
    <?php
    $url 
    'http://espnfc.com/bottomline/scores/scores?scoresSource=euro';
    $str file_get_contents($url);

    parse_str($str$myArray);

    print_r ($myArray);
    ?>
    I get this:
    Code:
    Array
    (
        [EUROSOC_s_delay] => 120
        [EUROSOC_s_stamp] => 20130930513807
        [EUROSOC_s_left1] => Fiorentina v Parma (2:45 PM ET)
        [EUROSOC_s_right1_1] => Italian Serie A
        [EUROSOC_s_url1] => http://soccernet.espn.go.com/preview?id=377257
        [EUROSOC_s_left2] => Granada v Athletic Bilbao (4:00 PM ET)
        [EUROSOC_s_right2_1] => Spanish Primera División
        [EUROSOC_s_url2] => http://soccernet.espn.go.com/preview?id=373147
        [EUROSOC_s_count] => 2
        [EUROSOC_s_loaded] => true
    )
    I want to pull out ONLY those 3-item combinations (left, right, url) that correspond to games for Italian Serie A. So, in the case of the current data, this is the only one I'd want:
    Code:
    [EUROSOC_s_left1] => Fiorentina v Parma (2:45 PM ET)
    [EUROSOC_s_right1_1] => Italian Serie A
    [EUROSOC_s_url1] => http://soccernet.espn.go.com/preview?id=377257
    I know it's like I am asking someone to do it for me, but I really just need a hint on where to start. I've used regex before, but never like this.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    16
    Rep Power
    0
    So I sucked it up, did some research and figured this thing out. In the process, I learned a great deal too. Who knew?

    I did things a little differently from how Jacques recommended though. Instead of going with the parse_str method, I just used regex.

    PHP Code:
    $url 'http://soccernet.espn.go.com/bottomline/scores/scores?scoresSource=euro';

    //Get page
    $content file_get_contents($url);

    //grab all variables out of page
    preg_match_all("/&([^=]+)=([^&]+)/"urldecode($content), $foo);

    $results = array(); 

    //loop thru all variables on page
    foreach ( $foo[1] as $key => $match ) {

    //get score, league, link info
    if (preg_match("/s_left\d+|s_right\d+_\d+|s_url\d+/"$match)) {;

      
    $results[] = $foo[2][$key];
      }
    }

    //group each game into score-league-link combination
    $games array_chunk($results3);


    $pattern "/Champ/";  //grab only UEFA Champions League games

    //loop through the data
    foreach($games as $key=>$value){
        
    //loop through each key under data sub array
        
    foreach($value as $key2=>$value2){
            
    //check for match.
            
    if(preg_match($pattern$value2)){
                
    //add to matches array.
                
    $matches[$key]=$value;
                
    //match found, break from foreach
                
    break;
            }
        }
    }

    //report each game's score-league-link combination
    foreach ($matches as $score) {
      echo 
    $score[0]."\n";
      echo 
    $score[1]."\n";  
      echo 
    $score[2]."\n";
      echo 
    "\n";

    And the results:
    Code:
    Bayer Leverkusen v Real Sociedad (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380765
    
    Shakhtar Donetsk v Manchester United (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380763
    
    Juventus v Galatasaray (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380762
    
    Real Madrid v FC Copenhagen (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380761
    
    Anderlecht v Olympiakos (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380764
    
    Paris Saint-Germain  v Benfica (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380760
    
    CSKA Moscow v Viktoria Plzen (12:00 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380766
    
    Manchester City v Bayern Munich (2:45 PM ET)
    UEFA Champions League
    http://soccernet.espn.go.com/preview?id=380759
    I also ran things using some archived data (since the current data reflected in this post doesn't show games in multiple leagues) to ensure my $pattern regex was truly working as it should. After that, it was simply a matter of adding the appropriate code for RSS 2.0 and modifying the echos to produce each feed item.

    So the one remaining question I have is in regards to my decision to use regexes in lieu of parse_str and how I subsequently grouped each game's score-league-link combination into it's own sub-array using the array_chunk function... Was this the best way to accomplish what I wanted? I just think back to what Jacques said about using parse_str and how it would reduce my code to a few simple lines... Did I make this harder than it really needed to be?

IMN logo majestic logo threadwatch logo seochat tools logo