#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2005
    Posts
    6
    Rep Power
    0

    A simple Regular Expression for HTML


    Hi,

    I am trying to get a regular expression that will parse an html file for <a href> links. I need to extract just the website that is linked to, and it needs to be able do get the link whether it has quotations around it or not.

    For example, parsing <a href="www.yahoo.com"> should give me "www.yahoo.com", while parsing <a href=www.yahoo.com> should give me www.yahoo.com

    For some reason, my regular expression attempts have not been working. If anyone could give me a hand, I'd appreciate it.

    Here are things that I have tried (page is an HTML file string):


    1.
    temp = re.compile(r"a href=(\S*)>")
    links = temp.findall(page)

    2.
    temp = re.match("a href=(.*?)>").group()
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Location
    The Holographic Universe
    Posts
    75
    Rep Power
    10

    Wink


    Hi I'm a python newbie, but maybe I can help...

    This is a simple attempt at matching the a hrefs from a HTML string (its not perfect though, it will obviously still match a href's which are in pre tags/commented out etc):

    Code:
    pattern = re.compile("a href=[\"']?([^\"'\s>]+)")
      hrefs = pattern.findall(page)
    If your still stuck trying looking at:

    Beautiful Soup (HTML Parser)
    Last edited by Markup; February 23rd, 2005 at 03:53 AM. Reason: Added link
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    A couple of things could go wrong with your regex. Firstly if there are spaces in the href then it will fail to match, e.g.
    <a href = "some-link" >

    Secondly "\S*" is greedy and will try to match as much as possible. So if there are no spaces but there is is a second tag after the href then it may match the whole thing. e.g.
    <a href="some-link"><b>Clickme</b></a>

    In this particular case the regex will match the entire text up to and including the closing </href>.

    A better regex is:

    r'''<a\s+href\s*=\s*['"](.*?)['"]'''

    The question mark in .*? makes the match non-greedy, so it will only match up to the next quote.

    If you are matching non xhtml compliant pages then you may want to make the regex case insensitive so it will match tags in upper case.

    The regex will still fail if the "a" tag has other attributes before the href, and it is very hard to code round that using a single regex. If you really want to cover all possibilites then it will be better to use an xml or html parsing library.

    Dave - The Developers' Coach
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2004
    Location
    Regensburg, Germany
    Posts
    147
    Rep Power
    17
    Try this:
    Code:
    pat = """<a\s+[^>]*?href\s*=\s*['"]?\s*([^\s'">]+)\s*['"]?\s*.*?>"""
    r = re.compile(pat, re.IGNORECASE|re.MULTILINE|re.DOTALL)
    urls = r.findall(text)
    This will extract URLs even if there are addional attributes like "class=... title=..." etc. or if the anchor tag contains line breaks.

IMN logo majestic logo threadwatch logo seochat tools logo