|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Need some quick help with urllib
Can someone please tell me how to use urllib 2, in such a way that it searches for text on a website and reports back whether or not it got the requested information or not? I used to have a guide that had this in it, but I cannot find it.....
|
|
#2
|
||||
|
||||
|
Here's a small example of what you need to do. You should get the idea
.Code:
>>> import urllib
>>>
>>> page = urllib.urlopen('http://www.python.org/')
>>> 'python' in page.read()
True
>>> 'perl' in page.read()
False
>>>
As you can see, what’s happening here is: we retrieve the page using the urlopen() function then use the in operator to check if the string ['python'] is present. Mark. |
|
#3
|
|||
|
|||
|
Thankyou for the help.
I was able to get this: Code:
print "Checking for a /. update..."
import urllib
last_news = file('C:\Thing.txt.', 'r')
slashdot = urllib.urlopen('http://slashdot.org')
if last_news in slashdot.read():
print "There are no recent updates."
else:
print "There is an update to go see."
line = '<a HREF="//slashdot.org/search.pl?topic=1'
new_news = line in slashdot.read()
last_news = new_news
last_news.file('C:\Thing.txt','w')
however, when I run it, I get this... Code:
>>>
Checking for a /. update...
Traceback (most recent call last):
File "E:/slashdotcheck.py", line 6, in -toplevel-
if last_news in slashdot.read():
TypeError: 'in <string>' requires string as left operand
>>>
Sorry to bother you again, but what does that mean and how can I fix it? |
|
#4
|
||||
|
||||
|
The problem is that last_news is a file object and not a string, you need to change your if statment to something like this:
Code:
if last_news.read() in slashdot.read() |
|
#5
|
|||
|
|||
|
Code:
print "Checking for a /. update..."
import urllib
last_news = file('C:\Thing.txt', 'r')
news = last_news.readlines()
check_news = str(news)
slashdot = urllib.urlopen('http://slashdot.org')
if check_news in slashdot.read():
print "There are no recent updates."
last.close()
else:
print "There is an update to go see."
line = '<a HREF="//slashdot.org/search.pl?topic=1'
new_news = line in slashdot.read()
new1_news = str(new_news)
last = file('C:\Thing.txt','w')
last.write(new1_news)
last.close()
Ok, so I was able to get this. However, whenever I ask it to check to see if a variable, which has been assigned a string, is there such as check_news it can never find it. I'm not sure why this is. Everything besides that works now however, thanks for the help so far. |
|
#6
|
||||
|
||||
|
Your using the readlines() method, the converting the list to a string. This is whats actually happening:
Code:
>>> someLines #Returned by readlines() ['line1', 'line2', 'line3'] >>> str(someLines) "['line1', 'line2', 'line3']" >>> As you can see, converting a list to a string using str() doesn't really look "right". (You wouldn't really find it in most web pages). Just use the file objects read() method to get the whole file as a string. |
|
#7
|
|||
|
|||
|
[QUOTE=pylon]
Code:
print "Checking for a /. update..."
import urllib
last_news = file('C:\Thing.txt', 'r')
news = last_news.readlines()
check_news = str(news)
slashdot = urllib.urlopen('http://slashdot.org')
if check_news in slashdot.read():
print "There are no recent updates."
last.close()
else:
print "There is an update to go see."
line = '<a HREF="//slashdot.org/search.pl?topic=1'
new_news = line in slashdot.read()
new1_news = str(new_news)
last = file('C:\Thing.txt','w')
last.write(new1_news)
last.close()
Quote:
There's no way to say this without sounding like a smart-alec [Edit: OK maybe there is - see above ], but it can't find it because it isn't there.Code:
news = last_news.readlines() check_news = str(news) file.readlines() returns a list, and str() of a list is literally, a string representation of a list with all the Python list delimiting characters in it - which literal text wont appear in a Slashdot page. Slashdot has an RSS feed, which is a kind of distilled website - the content without the presentation and graphics; it would be much much easier to use an RSS reading program as they do just this - check for updates every so often and keep you informed. Code:
if check_news in slashdot.read():
print "There are no recent updates."
last.close()
This will only tell you about recent updates once the content of check_news has fallen right off the site. To actually spot new news items, you would need to parse the HTML behind the site (View -> Source - that) to extract where the news items should be and look for new items. This is a technique known as screen-scraping, and is notoriously troublesome and prone to breaking - as every change on the site can break your script. It's one reason why news sites use things like RSS to just feed the latest news items to RSS client software. If you really want to do it yourself, looking at the RSS file (linked at the end of the site - http://slashdot.org/index.rss ) would probably be ten times easier than looking at the main page. But reading that 'properly' would require some use of Python with XML, which I have never tried. To make a horrible hack-job that might work, you could search the main content with a regular expression for the term "Posted by [any content] on [Day] [Month] [Year] @[any time]" and the first time you found that, store it. That would tell you if there were new updates, but not what they were. But it would still be processing HTML with a regular expression (ick) and be very prone to breaking, and be re-inventing the wheel. Code:
else:
print "There is an update to go see."
line = '<a HREF="//slashdot.org/search.pl?topic=1'
new_news = line in slashdot.read()
new1_news = str(new_news)
The construct "A in B" is only a test, it returns True (A is in B) or False (A is not in B) - it never extracts any of the content, so you would be writing "True" or "False" to the file. :| |
|
#8
|
|||
|
|||
|
Based on what you've told me, I decided to start off with a simpler site, www.half-life2.com/news.php
It hasn't done anything wrong yet. However, I was wondering how to get it to run on start-up on winxp Here's the code just in case you wanted to see: Code:
print "Checking for an update on Half-life2.com..."
import urllib
last_news = file('C:\Thing.txt', 'r')
news = last_news.read()
slashdot = urllib.urlopen('http://half-life2.com/news.php')
if news in slashdot.read():
print "There are no recent updates."
last_news.close()
else:
print "There is an update to go see."
line = "Arial,Helvetica,Geneva,Swiss,SunSans-Regular"
news = line in slashdot.read()
last = file('C:\Thing.txt','w')
last.write(news)
last.close()
|
|
#9
|
|||
|
|||
|
{deleted half post}
|
|
#10
|
|||||
|
|||||
|
Quote:
Create a new shortcut, and point it to "c:\python23\python.exe" (adjust this if you installed Python to somewhere else). Edit the shortcut, and change it to: "c:\python23\python.exe" "c:\path to myscript\myscript.py" Then drop the shortcut into the start menu under programs -> startup. You will probably want to add Code:
raw_input("Press any key to close...")
To the end of your program if you do this. Quote:
Pretend there has been an update, and put some made up old news in c:\thing.txt. As soon as it gets to the line Code:
last.write(news)
It crashes with: Code:
C:\>script1.py
Checking for an update on Half-life2.com...
There is an update to go see.
Traceback (most recent call last):
File "C:\script1.py", line 15, in ?
last.write(news)
TypeError: argument 1 must be string or read-only character buffer, not bool
C:\>
This means it will never write anything to the file c:\thing.txt, which means it can never tell you if there are any updates. Quote:
There's nothing simpler about any other news site - the first problem is that the news is buried in HTML and other markup describing how to display the text, where to put it, what font to use and so on, and this is universal to reading from any website. This is a problem because you have to manually sort through the code behind the site to find out where the adverts stop and the news begins. The second problem is that news sites keep old news visible. If you get a story from Jan 20th, and then you get a new story, when you search for the story from Jan 20th it will still be there. You have to actually look in the place the new news will be. Demonstration: Code:
>>> import urllib2
>>> site = urllib2.urlopen('http://half-life2.com/news.php')
>>> site.read()
'\n<html>\n\n\t<head>\n\t\t<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">\n\t\t<meta http-equiv="Page-Enter" content="blendTrans (Duration=0.25)">\n\t\t<meta name="AUTHOR" content="Valve Corporation">\n\t\t<LINK rel="stylesheet" type="text/css" href="vguide.css">\n\t\t<title>H A L F - L I F E 2</title>\n\t\t<csscriptdict>\n\t\t\t<script><!--\nCSInit = new Array;\nfunction CSScriptInit() {\nif(typeof(skipPage) != "undefined") { if(skipPage) return; }\nidxArray = new Array;\nfor(var i=0;i<CSInit.length;i++)\n\tidxArray[i] = i;\nCSAction2(CSInit, idxArray);}\nCSAg = window.navigator.userAgent; CSBVers = parseInt(CSAg.charAt(CSAg.indexOf("/")+1),10);\nCSIsW3CDOM
<snip>
>>>
That's what your program has to navigate... Code:
line = "Arial,Helvetica,Geneva,Swiss,SunSans-Regular"
news = line in slashdot.read()
Since they put that at the start of every news item, it will always always always find it, which is not useful if you need it to change when there is new news. I would normally write some more code to show what I mean, but this is a hard problem and it would take ages. Poking at the half-life news site though, we can see this: Code:
<!-- news content here! --> <p><font color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular"><a href="news.php?id=357" style="color: White;"><span class="class.01.newshed"><span class="class.newshed"><b>Valve Wins Summary Judgment Motions in Copyright Infringement Case</b></span></span></a></font><font color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular" size="2"><br> Valve today announced the U.S. Federal District Court in Seattle, WA granted its motion for summary judgment on the matters of Cyber Café Rights and Contractual Limitation of Liability in its copyright infringement suit with Sierra/Vivendi Universal Games. Click <a href="http://www.valvesoftware.com/C02-1683Z.htm">here</a> to read the judge's order.<div> </div></font><font size="2" color="white" face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular"><br> </font></p> "<!-- news content here! -->" seems to mark the start of the news items, so you could search for that, then extract probably everything between the matching paragraph (</p>) tags, and store that in the text file... |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Python Programming > Need some quick help with urllib |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|