#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2008
    Posts
    112
    Rep Power
    76

    Finding an RSS feed in HTML


    Hi there,

    I'm completely new to regular expressions (gotta pick up a book on it sometime soon!) and have been trying to figure out how to get an RSS feed out of HTML.

    I've successfully managed to get the title tag by using the following code:
    Code:
    				// Get the title tag
    				preg_match('@<title(.*)>(.*)</title>@i',$webContent,$titleTagArray);
    				
    				// If the title tag has been found, assign it to a variable
    				if($titleTagArray && $titleTagArray[3])
    					$webTitle = $titleTagArray[3];
    ...this seems to be working well, but I can't get the RSS feed address, i've copied what I'm using below in case someone finds it helpful:

    Code:
    				// Get the RSS or Atom feed address
    				preg_match('@<link(.*)rel="alternate"(.*)href="(.*)"(.*)type="application/rss+xml"\s/>@i',$webContent,$feedAddrArray);
    				
    				// If the feed address has been found, assign it to a variable
    				if($feedAddrArray && $feedAddrArray[2])
    					$webFeedAddr = $feedAddrArray[2];
    Hoping someone can offer some help
  2. #2
  3. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6351
    What does the HTML source look like? Your regular expression syntax seems sound, but maybe the tag attributes are in the wrong order.

    Also, we have a dedicated regular expression forum

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  4. #3
  5. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6351
    I've merged your two threads into one. Please take a look at the forum rules, cross-posting is a violation.

    -Dan
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  6. #4
  7. kill 9, $$;
    Devshed Supreme Being (6500+ posts)

    Join Date
    Sep 2001
    Location
    Shanghai, An tSín
    Posts
    6,897
    Rep Power
    3887
    Don't use regexps to parse HTML - there are HTML parsers available for all major programming languages.

    Two reasons that come to mind immediately that will cause your regexp to fail:
    a) as Dan says, if the attributes of the <link> tag are not in exactly the order you're looking for them ("rel" after "href" for instance).
    b) you're using greedy match (dot-star). Have a look at this article. It's written in a Perl context, but it applies to any regexp implementation.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2008
    Posts
    112
    Rep Power
    76
    Originally Posted by ManiacDan
    I've merged your two threads into one. Please take a look at the forum rules, cross-posting is a violation.

    -Dan
    Sorry about that Didn't think I'd get a reply and it was quite urgent, won't do it again!

    Thanks for the help I managed to get the regular expression working eventually

IMN logo majestic logo threadwatch logo seochat tools logo