#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    88
    Rep Power
    11

    Need some quick help with urllib


    Can someone please tell me how to use urllib 2, in such a way that it searches for text on a website and reports back whether or not it got the requested information or not? I used to have a guide that had this in it, but I cannot find it.....
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Here's a small example of what you need to do. You should get the idea .

    Code:
    >>> import urllib
    >>> 
    >>> page = urllib.urlopen('http://www.python.org/')
    >>> 'python' in page.read()
    True
    >>> 'perl' in page.read()
    False
    >>>
    As you can see, what’s happening here is: we retrieve the page using the urlopen() function then use the in operator to check if the string ['python'] is present.

    Mark.
    programming language development: www.netytan.com Hula

  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    88
    Rep Power
    11
    Thankyou for the help.
    I was able to get this:
    Code:
    print "Checking for a /. update..."
    import urllib
    
    last_news = file('C:\Thing.txt.', 'r')
    slashdot = urllib.urlopen('http://slashdot.org')
    if last_news in slashdot.read():
        print "There are no recent updates."
    else:
        print "There is an update to go see."
        line = '<a HREF="//slashdot.org/search.pl?topic=1'
        new_news = line in slashdot.read()
        last_news = new_news
        last_news.file('C:\Thing.txt','w')
    however, when I run it, I get this...
    Code:
    >>> 
    Checking for a /. update...
    
    Traceback (most recent call last):
      File "E:/slashdotcheck.py", line 6, in -toplevel-
        if last_news in slashdot.read():
    TypeError: 'in <string>' requires string as left operand
    >>>
    Sorry to bother you again, but what does that mean and how can I fix it?
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    The problem is that last_news is a file object and not a string, you need to change your if statment to something like this:

    Code:
    if last_news.read() in slashdot.read()
    programming language development: www.netytan.com Hula

  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    88
    Rep Power
    11
    Code:
    print "Checking for a /. update..."
    import urllib
    
    last_news = file('C:\Thing.txt', 'r')
    news = last_news.readlines()
    check_news = str(news)
    slashdot = urllib.urlopen('http://slashdot.org')
    if check_news in slashdot.read():
        print "There are no recent updates."
        last.close()
    else:
        print "There is an update to go see."
        line = '<a HREF="//slashdot.org/search.pl?topic=1'
        new_news = line in slashdot.read()
        new1_news = str(new_news)
        last = file('C:\Thing.txt','w')
        last.write(new1_news)
        last.close()
    Ok, so I was able to get this. However, whenever I ask it to check to see if a variable, which has been assigned a string, is there such as check_news it can never find it. I'm not sure why this is. Everything besides that works now however, thanks for the help so far.
  10. #6
  11. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Your using the readlines() method, the converting the list to a string. This is whats actually happening:

    Code:
    >>> someLines #Returned by readlines()
    ['line1', 'line2', 'line3']
    >>> str(someLines)
    "['line1', 'line2', 'line3']"
    >>>
    As you can see, converting a list to a string using str() doesn't really look "right". (You wouldn't really find it in most web pages). Just use the file objects read() method to get the whole file as a string.
    programming language development: www.netytan.com Hula

  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2003
    Posts
    624
    Rep Power
    35
    [QUOTE=pylon]
    Code:
    print "Checking for a /. update..."
    import urllib
    
    last_news = file('C:\Thing.txt', 'r')
    news = last_news.readlines()
    check_news = str(news)
    slashdot = urllib.urlopen('http://slashdot.org')
    if check_news in slashdot.read():
        print "There are no recent updates."
        last.close()
    else:
        print "There is an update to go see."
        line = '<a HREF="//slashdot.org/search.pl?topic=1'
        new_news = line in slashdot.read()
        new1_news = str(new_news)
        last = file('C:\Thing.txt','w')
        last.write(new1_news)
        last.close()
    However, whenever I ask it to check to see if a variable, which has been assigned a string, is there such as check_news it can never find it. I'm not sure why this is.
    There's no way to say this without sounding like a smart-alec [Edit: OK maybe there is - see above ], but it can't find it because it isn't there.

    Code:
    news = last_news.readlines()
    check_news = str(news)
    file.readlines() returns a list, and str() of a list is literally, a string representation of a list with all the Python list delimiting characters in it - which literal text wont appear in a Slashdot page.

    Slashdot has an RSS feed, which is a kind of distilled website - the content without the presentation and graphics; it would be much much easier to use an RSS reading program as they do just this - check for updates every so often and keep you informed.

    Code:
    if check_news in slashdot.read():
        print "There are no recent updates."
        last.close()
    This will only tell you about recent updates once the content of check_news has fallen right off the site. To actually spot new news items, you would need to parse the HTML behind the site (View -> Source - that) to extract where the news items should be and look for new items. This is a technique known as screen-scraping, and is notoriously troublesome and prone to breaking - as every change on the site can break your script. It's one reason why news sites use things like RSS to just feed the latest news items to RSS client software.

    If you really want to do it yourself, looking at the RSS file (linked at the end of the site - http://slashdot.org/index.rss ) would probably be ten times easier than looking at the main page. But reading that 'properly' would require some use of Python with XML, which I have never tried.

    To make a horrible hack-job that might work, you could search the main content with a regular expression for the term "Posted by [any content] on [Day] [Month] [Year] @[any time]" and the first time you found that, store it. That would tell you if there were new updates, but not what they were. But it would still be processing HTML with a regular expression (ick) and be very prone to breaking, and be re-inventing the wheel.

    Code:
    else:
        print "There is an update to go see."
        line = '<a HREF="//slashdot.org/search.pl?topic=1'
        new_news = line in slashdot.read()
        new1_news = str(new_news)
    The construct "A in B" is only a test, it returns True (A is in B) or False (A is not in B) - it never extracts any of the content, so you would be writing "True" or "False" to the file.

    :|
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2004
    Posts
    88
    Rep Power
    11
    Based on what you've told me, I decided to start off with a simpler site, www.half-life2.com/news.php
    It hasn't done anything wrong yet. However, I was wondering how to get it to run on start-up on winxp
    Here's the code just in case you wanted to see:
    Code:
    print "Checking for an update on Half-life2.com..."
    import urllib
    
    last_news = file('C:\Thing.txt', 'r')
    news = last_news.read()
    slashdot = urllib.urlopen('http://half-life2.com/news.php')
    if news in slashdot.read():
        print "There are no recent updates."
        last_news.close()
    else:
        print "There is an update to go see."
        line = "Arial,Helvetica,Geneva,Swiss,SunSans-Regular"
        news = line in slashdot.read()
        last = file('C:\Thing.txt','w')
        last.write(news)
        last.close()
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2003
    Posts
    624
    Rep Power
    35
    {deleted half post}
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Nov 2003
    Posts
    624
    Rep Power
    35
    However, I was wondering how to get it to run on start-up on winxp
    Create a new shortcut, and point it to "c:\python23\python.exe" (adjust this if you installed Python to somewhere else).
    Edit the shortcut, and change it to:

    "c:\python23\python.exe" "c:\path to myscript\myscript.py"

    Then drop the shortcut into the start menu under programs -> startup.

    You will probably want to add

    Code:
    raw_input("Press any key to close...")
    To the end of your program if you do this.


    Originally Posted by pylon
    It hasn't done anything wrong yet.
    Pretend there has been an update, and put some made up old news in c:\thing.txt.

    As soon as it gets to the line

    Code:
        last.write(news)
    It crashes with:

    Code:
    C:\>script1.py
    Checking for an update on Half-life2.com...
    There is an update to go see.
    Traceback (most recent call last):
      File "C:\script1.py", line 15, in ?
        last.write(news)
    TypeError: argument 1 must be string or read-only character buffer, not bool
    
    C:\>
    This means it will never write anything to the file c:\thing.txt, which means it can never tell you if there are any updates.


    I decided to start off with a simpler site, www.half-life2.com/news.php
    There's nothing simpler about any other news site - the first problem is that the news is buried in HTML and other markup describing how to display the text, where to put it, what font to use and so on, and this is universal to reading from any website. This is a problem because you have to manually sort through the code behind the site to find out where the adverts stop and the news begins.

    The second problem is that news sites keep old news visible. If you get a story from Jan 20th, and then you get a new story, when you search for the story from Jan 20th it will still be there.
    You have to actually look in the place the new news will be.

    Demonstration:

    Code:
    >>> import urllib2
    >>> site = urllib2.urlopen('http://half-life2.com/news.php')
    >>> site.read()
    '\n<html>\n\n\t<head>\n\t\t<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">\n\t\t<meta http-equiv="Page-Enter" content="blendTrans (Duration=0.25)">\n\t\t<meta name="AUTHOR" content="Valve Corporation">\n\t\t<LINK rel="stylesheet" type="text/css" href="vguide.css">\n\t\t<title>H A L F - L I F E  2</title>\n\t\t<csscriptdict>\n\t\t\t<script><!--\nCSInit = new Array;\nfunction CSScriptInit() {\nif(typeof(skipPage) != "undefined") { if(skipPage) return; }\nidxArray = new Array;\nfor(var i=0;i<CSInit.length;i++)\n\tidxArray[i] = i;\nCSAction2(CSInit, idxArray);}\nCSAg = window.navigator.userAgent; CSBVers = parseInt(CSAg.charAt(CSAg.indexOf("/")+1),10);\nCSIsW3CDOM 
    
    <snip>
    >>>
    That's what your program has to navigate...

    Code:
        line = "Arial,Helvetica,Geneva,Swiss,SunSans-Regular"
        news = line in slashdot.read()
    Since they put that at the start of every news item, it will always always always find it, which is not useful if you need it to change when there is new news.

    I would normally write some more code to show what I mean, but this is a hard problem and it would take ages. Poking at the half-life news site though, we can see this:

    Code:
    <!-- news content here! -->
    										
    											<p><font color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular"><a href="news.php?id=357" style="color: White;"><span class="class.01.newshed"><span class="class.newshed"><b>Valve Wins Summary Judgment Motions in Copyright Infringement Case</b></span></span></a></font><font color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular" size="2"><br>
    Valve today announced the U.S. Federal District Court in Seattle, WA granted its motion for summary judgment on the matters of Cyber Café Rights and Contractual Limitation of Liability in its copyright infringement suit with Sierra/Vivendi Universal Games. Click <a href="http://www.valvesoftware.com/C02-1683Z.htm">here</a> to read the judge's order.<div>&nbsp;</div></font><font size="2" color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular"><br>
    
    </font></p>
    "<!-- news content here! -->" seems to mark the start of the news items, so you could search for that, then extract probably everything between the matching paragraph (</p>) tags, and store that in the text file...

IMN logo majestic logo threadwatch logo seochat tools logo