#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2005
    Posts
    4
    Rep Power
    0

    Python & The Web: Scenario


    Basically my script's goal is to:
    1. access a local.yahoo.com results page that I have downloaded onto my hard drive.
    2. visit a particular link on that results page.
    3. parse info from that particular link.
    4. go back to results page and visit the "Next" link.
    5. Repeat steps 2-4.

    At the end, one file will have all the parsed info from all the visited webpages.

    I am new to the internet capabilities realm with Python, and after reading al the documentation on it, I am a little confused. What I want is to just fetch and read webpages via the "Next" link without having a broswer pop up as a result. I looked at webbrowser, urllib, SGMLParser, and HTMLParser in the Python documentation but it didn't clearly describe what I was looking for.

    Is there a way to do access and do stuff to HTMLs without opening any browser windows in the process?

    Any suggestions would be greatly appreciated.

    THANKS!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    461
    Rep Power
    25
    yes the urllib2 will be the best bet to doing this. I would say. It lets you handle a conneciton to a web site as a file. There is also a html parser file, but I don't know if that would be good for you. Your best bet will be to google some information about urllib2 for python and get the basics then you can play with the html praser in python to see if you like it or if it would help.

    for an example of urllib2:

    Code:
    import urllib2
    website = urllib2.urlopen("http://local.yahoo.com")
    for line in website:
       print line
    this will print all the lines of the source to that page.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2004
    Location
    There where the rabbits jump
    Posts
    556
    Rep Power
    11
    I dont fully understand what you mean with you have downloaded on your computer, it would be much easier to access it from your program with the urllib module.
    Those people who think they know everything are a great annoyance to those of us who do.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2005
    Location
    Saint Kiev Russia
    Posts
    13
    Rep Power
    0
    for parse html i use function:
    Code:
    import formatter
    import htmllib
    
    def html2txt(data):
        txt=''
        w = formatter.DumbWriter(txt)
        f = formatter.AbstractFormatter(w)
        p = htmllib.HTMLParser(f)
        txt=p.feed(data)
        p.close()
        return txt

IMN logo majestic logo threadwatch logo seochat tools logo