#1
  1. Cast down
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Location
    Sweden
    Posts
    321
    Rep Power
    11

    HTML parsing template & regex problem


    Update: I fixed the problem, I had to escape the ?
    but I noticed this is the worst solution to my for what I want.. I can't escape all the characters in the html, how would I parse an html file? any standard module?


    I want to parse some sites, but they aren't fixed, I want the user to be able to write a template of how the site is parsed. Not too complicated, just very basic. For example: I want it so the user can get the title of a page using a template and regex.

    Here is an example of the problem:
    (This works)
    Code:
    	temp = '<A x="HI!!!" file="abc.exe">(.*?)</A>'
    	html = '\n-_-<A x="HI!!!" file="abc.exe">Call to action</A></A>-_-\n\t'
    	m=re.search(temp, html)
    	if m: print m.groups()
    (This does not work)
    Code:
    	temp = '<A x="HI!!!" file?"abc.exe">(.*?)</A>'
    	html = '\n-_-<A x="HI!!!" file?"abc.exe">Call to action</A></A>-_-\n\t'
    	m=re.search(temp, html)
    	if m: print m.groups()
    Last edited by movEAX_444; September 23rd, 2004 at 06:28 PM.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    Regex are not a good choice for parsing html, as you have discovered.

    Python has two libraries for parsing html, the first imaginatively called htmllib which is rather old and poorly documented since there are no examples. the second is the newer HTMLParser, which is better documented.

    The excellent online book Dive Into Python has a chapter on html processing. The book uses sgmllib which is the more generic parent of htmllib, but the principles are the same.

    If you know that the HTML will also be XML compliant (XHTML) then you can parse it with any of the numerous XML parsers that exist for Python.

    Dave - The Developers' Coach
    Last edited by DevCoach; September 24th, 2004 at 03:41 AM.

IMN logo majestic logo threadwatch logo seochat tools logo