Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now!
  #1  
Old February 11th, 2004, 02:09 AM
musashiBRS musashiBRS is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 3 musashiBRS User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Help with HTML Parsing

Good morning all.

I'm currently working on a python script that parses html files made with Dreamweaver templates. I've followed the example in URL using the method

def handle_data(self, data):
self.TAGDATA.append(data)

This works fine, but when I use special characters in html, like á, à and others, these are not shown; if I print the the data directly, I end up with things like ',' for ó, and if I make a print "".join(data) all those characters dissapear.

Any hint on this?

Thanks in advance

Reply With Quote
  #2  
Old February 11th, 2004, 07:09 AM
MasterChief MasterChief is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2003
Location: Virginia
Posts: 491 MasterChief User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 47 m 47 sec
Reputation Power: 6
Send a message via AIM to MasterChief Send a message via MSN to MasterChief
Are you typing in the characters? If so, I suggest using the HTML entity for that character.

Code:
& #224; makes a small 'a' grave
à


(Ignore the space between the ampersand and the number sign. That is there so that it doesn't make the entity on the forums.)

I tested using HTML entites in my Python CGI script.

You can see that, here.

If that doesn't solve the problem, could you provide us with a bit more information?

Thanks!

Reply With Quote
  #3  
Old February 12th, 2004, 01:50 AM
musashiBRS musashiBRS is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 3 musashiBRS User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Thanks for your reply,

Yes I'm using the html entities (in fact, i wrote & aacute; in forum, but ended with á).

Let me explain a bit more what i'm doing: I have some html files made with dreamweaver templates, and i'm retrieving one tag and show it in a Python SSI on another page.

I get the tag with HTMLParser successfully, with
Code:
	def __init__(self):
		HTMLParser.__init__(self)
		self.TAG=0
		self.TAGDATA=[]
	def handle_comment(self,tag):
		if tag == " InstanceBeginEditable name=\"Titulo\" ":
			self.TAG=1
		if tag == " InstanceEndEditable ":
			self.TAG=0
	def handle_data(self,data):
		if self.TAG==1:
			self.TAGDATA.append(data)


but the problem gets with the html entities... & oacute;, for example, shows as ', ':
Code:
#return the title:
	def return_title(self):
		return self.TAGDATA
#and print it with join:
        titulo = parser.return_title()
        print ''.join(titulo)

I use join so I avoid the [] of the beginning and end of the list; when i do so, the html entities don't show at all; if I print self.TAGDATA I end with things like I told: ', ' for & oacute;

Hope I've explained a little bit better now

Last edited by musashiBRS : February 12th, 2004 at 01:53 AM.

Reply With Quote
  #4  
Old February 12th, 2004, 06:49 AM
MasterChief MasterChief is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2003
Location: Virginia
Posts: 491 MasterChief User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 47 m 47 sec
Reputation Power: 6
Send a message via AIM to MasterChief Send a message via MSN to MasterChief
Run IDLE. (It's either here:

Code:
start
 All Programs
  Python 2.3
   IDLE Python (GUI)


or here:

Code:
Go to where Python is saved. The default is:

start
 My Computer
  Local Disk (C:/)
   Python23
    Lib
     idlelib
      idle.py


Then open:

Code:
File
 Open... (Ctrl+O)
  [browse your computer for the script]
   Click on 'Open' when you're done.


Now, run the script.

Code:
Run
 Run Module (F5)


Copy and paste all of the errors (if you are getting any).

Try adding this to the very top of the main page (should be the first two lines):

Code:
#!/usr/bin/env python
print 'Content-Type: text/html\n'


Didn't work? Try adding it to the pages you are including. Didn't work? Try adding to just the pages you are including.

Let us know how things go.

Reply With Quote
  #5  
Old February 12th, 2004, 08:06 AM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,529 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 17 h 19 m 5 sec
Reputation Power: 63
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
If you're viewing the results in your browser (from CGI for instance) then the HTML entities may be working fine, but the browser is showing them as the chars they represent. In which case you should check out the source for the page. Worth considering.

If you want to show these entity why not replace '&' with whatever '&' is in HTML characters encoding.

Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
  #6  
Old February 12th, 2004, 01:03 PM
MasterChief MasterChief is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2003
Location: Virginia
Posts: 491 MasterChief User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 47 m 47 sec
Reputation Power: 6
Send a message via AIM to MasterChief Send a message via MSN to MasterChief
Yeah, I suggest doing that.

Here is the entity for the ampersand.

& #38;
&

Reply With Quote
  #7  
Old February 13th, 2004, 02:32 AM
musashiBRS musashiBRS is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 3 musashiBRS User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
First of all thanks for your interest.

Forgot to say that if I do a print "& aacute;" or any other, these show correctly (on IDLE and on webserver)
I'm already adding the Content-Type, and using IDLE.

Here I attach the code:

Code:
#!/usr/local/bin/python
print "Content-type: text/html\n"
print

from HTMLParser import HTMLParser
import htmllib
import formatter
import string

class MyParser(HTMLParser):

	def __init__(self):
		HTMLParser.__init__(self)
		self.TAG=0
		self.TAGDATA=[]
	def handle_comment(self,tag):
		if tag == " InstanceBeginEditable name=\"Titulo\" ":
			self.TAG=1
		if tag == " InstanceEndEditable ":
			self.TAG=0
	def handle_data(self,data):
		if self.TAG==1:
			self.TAGDATA.append(data)
	def return_title(self):
                print self.TAGDATA
		return self.TAGDATA


def write_table_header():
	print "<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\">"
	print "<tr>"
	print "<td width=\"200\" valign=\"top\">"
	print "<table width=\"200\" border=\"0\" cellpadding=\"5\" cellspacing=\"0\">"

def write_table_footer():
	print "</table>"
	print "</td>"
	print "</tr>"
	print "</table>"

def write_noticia_header():
	print "<tr>"
	print "<td align=\"right\" valign=\"top\">"
	print "<img src=\"gif/p15.gif\" align=\"top\" border=\"0\" width=\"30\" height=\"10\" naturalsizeflag=\"0\"></td>"
	print "<td valign=\"top\">"

def write_noticia_body(line,parser):
        line=line.strip()
        print "<a href=\"#\" onClick=\"openWindow('noticies/%s')\">" % line
        titulo = parser.return_title()
        print ''.join(titulo)
        print "<img src=\"gif/punt.gif\" width=\"30\" height=\"10\" border=\"0\" align=\"absbottom\"></a>"

def write_noticia_footer():
	print "</td>"
	print "</tr>"

def write_noticia(line,parser):
	write_noticia_header()
	write_noticia_body(line,parser)
	write_noticia_footer()

def write_separator():
    print "</table>"
    print "</td>"
    print "<td width=\"10\" valign=\"middle\" height=\"100%\">"
    print "<table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" width=\"1\" bgcolor=\"#aaaaaa\" height=\"97%\">"
    print "<tr>"
    print "<td height=\"100%\"><img src=\"gif/blanc.gif\" width=\"1\" height=\"32\" border=\"0\"></td>"
    print "</tr>"
    print "</table>"
    print "</td>"
    print "<td width=\"193\" valign=\"top\">"
    print "<table border=\"0\" cellpadding=\"5\" cellspacing=\"0\" width=\"191\" height=\"63\">"

def main():
    noticies=open('noticies.txt','r')

    line=noticies.readline()
    write_table_header()
    for line in noticies:
        if(line=="dreta:\n"):
            write_separator()
        else:
            line=string.rstrip(line)
            f=open(line,'r')
            parser = MyParser()
            parser.feed(f.read())
            write_noticia(line,parser)
            parser.close()
            f.close()

    write_table_footer()
    noticies.close()

if __name__ == "__main__":
    main()


This script reads from the following example file:

Code:
#HTML from which the script reads the data: 

<table width="100%" cellpadding="2" cellspacing="0">
  <tr>
          <td><font size="4" face="Arial, Helvetica, sans-serif"><b><!-- InstanceBeginEditable name="Titulo" --> Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci& oacute; del Mercat Central<!-- InstanceEndEditable --></b></font></td>
  </tr>
</table>


And the output, even on IDLE (as a CGI on a webserver, happens the same), is as follows:
Code:
#output of parser.TAGDATA:
[" Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci", ' del Mercat Central']
#output of ''.join(parser.TAGDATA)
 Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci del Mercat Central
#when the right output should be:
 Conveni entre l'Ajuntament de Tarragona i MERCASA per a la remodelaci& oacute; del Mercat Central


Seems like I don't read correctly the html... but I don't know what I'm doing wrong

Thanks again

Last edited by musashiBRS : February 13th, 2004 at 02:36 AM.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Help with HTML Parsing


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway