February 22nd, 2005, 09:43 PM
A simple Regular Expression for HTML
I am trying to get a regular expression that will parse an html file for <a href> links. I need to extract just the website that is linked to, and it needs to be able do get the link whether it has quotations around it or not.
For example, parsing <a href="www.yahoo.com"> should give me "www.yahoo.com", while parsing <a href=www.yahoo.com> should give me www.yahoo.com
For some reason, my regular expression attempts have not been working. If anyone could give me a hand, I'd appreciate it.
Here are things that I have tried (page is an HTML file string):
temp = re.compile(r"a href=(\S*)>")
links = temp.findall(page)
temp = re.match("a href=(.*?)>").group()
February 23rd, 2005, 03:50 AM
Hi I'm a python newbie, but maybe I can help...
This is a simple attempt at matching the a hrefs from a HTML string (its not perfect though, it will obviously still match a href's which are in pre tags/commented out etc):
If your still stuck trying looking at:
pattern = re.compile("a href=[\"']?([^\"'\s>]+)")
hrefs = pattern.findall(page)
Beautiful Soup (HTML Parser)
Last edited by Markup; February 23rd, 2005 at 03:53 AM.
Reason: Added link
February 23rd, 2005, 04:00 AM
A couple of things could go wrong with your regex. Firstly if there are spaces in the href then it will fail to match, e.g.
<a href = "some-link" >
Secondly "\S*" is greedy and will try to match as much as possible. So if there are no spaces but there is is a second tag after the href then it may match the whole thing. e.g.
In this particular case the regex will match the entire text up to and including the closing </href>.
A better regex is:
The question mark in .*? makes the match non-greedy, so it will only match up to the next quote.
If you are matching non xhtml compliant pages then you may want to make the regex case insensitive so it will match tags in upper case.
The regex will still fail if the "a" tag has other attributes before the href, and it is very hard to code round that using a single regex. If you really want to cover all possibilites then it will be better to use an xml or html parsing library.
Dave - The Developers' Coach
February 23rd, 2005, 04:56 AM
This will extract URLs even if there are addional attributes like "class=... title=..." etc. or if the anchor tag contains line breaks.
pat = """<a\s+[^>]*?href\s*=\s*['"]?\s*([^\s'">]+)\s*['"]?\s*.*?>"""
r = re.compile(pat, re.IGNORECASE|re.MULTILINE|re.DOTALL)
urls = r.findall(text)