#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    3
    Rep Power
    0

    Help with HTML Parsing


    Good morning all.

    I'm currently working on a python script that parses html files made with Dreamweaver templates. I've followed the example in http://forums.devshed.com/t106786/s.html, using the method

    def handle_data(self, data):
    self.TAGDATA.append(data)

    This works fine, but when I use special characters in html, like á, à and others, these are not shown; if I print the the data directly, I end up with things like ',' for ó, and if I make a print "".join(data) all those characters dissapear.

    Any hint on this?

    Thanks in advance
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2003
    Location
    Canada
    Posts
    543
    Rep Power
    24
    Are you typing in the characters? If so, I suggest using the HTML entity for that character.

    Code:
    & #224; makes a small 'a' grave
    à
    (Ignore the space between the ampersand and the number sign. That is there so that it doesn't make the entity on the forums.)

    I tested using HTML entites in my Python CGI script.

    You can see that, here.

    If that doesn't solve the problem, could you provide us with a bit more information?

    Thanks!
  4. #3
  5. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    3
    Rep Power
    0
    Thanks for your reply,

    Yes I'm using the html entities (in fact, i wrote & aacute; in forum, but ended with &aacute.

    Let me explain a bit more what i'm doing: I have some html files made with dreamweaver templates, and i'm retrieving one tag and show it in a Python SSI on another page.

    I get the tag with HTMLParser successfully, with
    Code:
    	def __init__(self):
    		HTMLParser.__init__(self)
    		self.TAG=0
    		self.TAGDATA=[]
    	def handle_comment(self,tag):
    		if tag == " InstanceBeginEditable name=\"Titulo\" ":
    			self.TAG=1
    		if tag == " InstanceEndEditable ":
    			self.TAG=0
    	def handle_data(self,data):
    		if self.TAG==1:
    			self.TAGDATA.append(data)
    but the problem gets with the html entities... & oacute;, for example, shows as ', ':
    Code:
    #return the title:
    	def return_title(self):
    		return self.TAGDATA
    #and print it with join:
            titulo = parser.return_title()
            print ''.join(titulo)
    I use join so I avoid the [] of the beginning and end of the list; when i do so, the html entities don't show at all; if I print self.TAGDATA I end with things like I told: ', ' for & oacute;

    Hope I've explained a little bit better now
    Last edited by musashiBRS; February 12th, 2004 at 02:53 AM.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2003
    Location
    Canada
    Posts
    543
    Rep Power
    24
    Run IDLE. (It's either here:

    Code:
    start
     All Programs
      Python 2.3
       IDLE Python (GUI)
    or here:

    Code:
    Go to where Python is saved. The default is:
    
    start
     My Computer
      Local Disk (C:/)
       Python23
        Lib
         idlelib
          idle.py
    Then open:

    Code:
    File
     Open... (Ctrl+O)
      [browse your computer for the script]
       Click on 'Open' when you're done.
    Now, run the script.

    Code:
    Run
     Run Module (F5)
    Copy and paste all of the errors (if you are getting any).

    Try adding this to the very top of the main page (should be the first two lines):

    Code:
    #!/usr/bin/env python
    print 'Content-Type: text/html\n'
    Didn't work? Try adding it to the pages you are including. Didn't work? Try adding to just the pages you are including.

    Let us know how things go.
  8. #5
  9. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    If you're viewing the results in your browser (from CGI for instance) then the HTML entities may be working fine, but the browser is showing them as the chars they represent. In which case you should check out the source for the page. Worth considering.

    If you want to show these entity why not replace '&' with whatever '&' is in HTML characters encoding.

    Mark.
    programming language development: www.netytan.com Hula

  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2003
    Location
    Canada
    Posts
    543
    Rep Power
    24
    Yeah, I suggest doing that.

    Here is the entity for the ampersand.

    & #38;
    &
  12. #7
  13. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    3
    Rep Power
    0
    First of all thanks for your interest.

    Forgot to say that if I do a print "& aacute;" or any other, these show correctly (on IDLE and on webserver)
    I'm already adding the Content-Type, and using IDLE.

    Here I attach the code:

    Code:
    #!/usr/local/bin/python
    print "Content-type: text/html\n"
    print
    
    from HTMLParser import HTMLParser
    import htmllib
    import formatter
    import string
    
    class MyParser(HTMLParser):
    
    	def __init__(self):
    		HTMLParser.__init__(self)
    		self.TAG=0
    		self.TAGDATA=[]
    	def handle_comment(self,tag):
    		if tag == " InstanceBeginEditable name=\"Titulo\" ":
    			self.TAG=1
    		if tag == " InstanceEndEditable ":
    			self.TAG=0
    	def handle_data(self,data):
    		if self.TAG==1:
    			self.TAGDATA.append(data)
    	def return_title(self):
                    print self.TAGDATA
    		return self.TAGDATA
    
    
    def write_table_header():
    	print "<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\">"
    	print "<tr>"
    	print "<td width=\"200\" valign=\"top\">"
    	print "<table width=\"200\" border=\"0\" cellpadding=\"5\" cellspacing=\"0\">"
    
    def write_table_footer():
    	print "</table>"
    	print "</td>"
    	print "</tr>"
    	print "</table>"
    
    def write_noticia_header():
    	print "<tr>"
    	print "<td align=\"right\" valign=\"top\">"
    	print "<img src=\"gif/p15.gif\" align=\"top\" border=\"0\" width=\"30\" height=\"10\" naturalsizeflag=\"0\"></td>"
    	print "<td valign=\"top\">"
    
    def write_noticia_body(line,parser):
            line=line.strip()
            print "<a href=\"#\" onClick=\"openWindow('noticies/%s')\">" % line
            titulo = parser.return_title()
            print ''.join(titulo)
            print "<img src=\"gif/punt.gif\" width=\"30\" height=\"10\" border=\"0\" align=\"absbottom\"></a>"
    
    def write_noticia_footer():
    	print "</td>"
    	print "</tr>"
    
    def write_noticia(line,parser):
    	write_noticia_header()
    	write_noticia_body(line,parser)
    	write_noticia_footer()
    
    def write_separator():
        print "</table>"
        print "</td>"
        print "<td width=\"10\" valign=\"middle\" height=\"100%\">"
        print "<table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"1\" bgcolor=\"#aaaaaa\" height=\"97%\">"
        print "<tr>"
        print "<td height=\"100%\"><img src=\"gif/blanc.gif\" width=\"1\" height=\"32\" border=\"0\"></td>"
        print "</tr>"
        print "</table>"
        print "</td>"
        print "<td width=\"193\" valign=\"top\">"
        print "<table border=\"0\" cellpadding=\"5\" cellspacing=\"0\" width=\"191\" height=\"63\">"
    
    def main():
        noticies=open('noticies.txt','r')
    
        line=noticies.readline()
        write_table_header()
        for line in noticies:
            if(line=="dreta:\n"):
                write_separator()
            else:
                line=string.rstrip(line)
                f=open(line,'r')
                parser = MyParser()
                parser.feed(f.read())
                write_noticia(line,parser)
                parser.close()
                f.close()
    
        write_table_footer()
        noticies.close()
    
    if __name__ == "__main__":
        main()
    This script reads from the following example file:

    Code:
    #HTML from which the script reads the data: 
    
    <table width="100%" cellpadding="2" cellspacing="0">
      <tr>
              <td><font size="4" face="Arial, Helvetica, sans-serif"><b><!-- InstanceBeginEditable name="Titulo" --> Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci& oacute; del Mercat Central<!-- InstanceEndEditable --></b></font></td>
      </tr>
    </table>
    And the output, even on IDLE (as a CGI on a webserver, happens the same), is as follows:
    Code:
    #output of parser.TAGDATA:
    [" Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci", ' del Mercat Central']
    #output of ''.join(parser.TAGDATA)
     Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci del Mercat Central
    #when the right output should be:
     Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci& oacute; del Mercat Central
    Seems like I don't read correctly the html... but I don't know what I'm doing wrong

    Thanks again
    Last edited by musashiBRS; February 13th, 2004 at 03:36 AM.

IMN logo majestic logo threadwatch logo seochat tools logo