The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
While + BeautifulSoup + urllib2
Discuss While + BeautifulSoup + urllib2 in the Python Programming forum on Dev Shed. While + BeautifulSoup + urllib2 Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

March 3rd, 2013, 09:20 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
|
While + BeautifulSoup + urllib2
Hello guys, i have a question with the loop "while" and the library Beautiful Soup. If i have the page site.com/?cat=1&page=1 and the category have, for example, 10 pages. I wanna parse all of them (with Beautiful Soup), maybe using while, but i don't know who.
Actually, i use this code:
PHP Code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = 'somesite/?cat=188&paged=1'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
url = 'somesite/cat?=188&paged=2'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)
xb = soup.findAll('a', href=True)
# ... MORE URLS ...
for web in xa:
if web['href'].startswith('example'):
print web['href']
for web in xb:
if web['href'].startswith('example'):
print web['href']
# ... MORE FOR's ...
but is a very rustic code to parse too many pages. If anyone can help me, i'll be grateful.
|

March 3rd, 2013, 10:28 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 12
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
|
|
PHP Code:
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/cat?=188&paged=%d' % (pagenumber)
try:
with urllib2.urlopen(url) as page:
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
except IOError as e:
break
|

March 3rd, 2013, 10:53 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by Nolander21
PHP Code:
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/cat?=188&paged=%d' % (pagenumber)
try:
with urllib2.urlopen(url) as page:
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
except IOError as e:
break
|
Hello, thanks for replying. But I get this error:
Code:
Traceback (most recent call last):
File "/home/XXX/Escritorio/while3.py", line 12, in <module>
with urllib2.urlopen(url) as page:
AttributeError: addinfourl instance has no attribute '__exit__'
PD: I have Python 2.7
|

March 3rd, 2013, 10:58 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 12
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
|
|
|
I've never actually used that library. Try changing IOError to URLError
|

March 3rd, 2013, 11:03 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by Nolander21 I've never actually used that library. Try changing IOError to URLError |
Now I get this:
Code:
Traceback (most recent call last):
File "/home/XXX/Escritorio/while3.py", line 20, in <module>
except URLError as e:
NameError: name 'URLError' is not defined

|

March 3rd, 2013, 11:19 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
Sorry for the double post, but i "fix" the error with this code:
PHP Code:
from BeautifulSoup import BeautifulSoup
from contextlib import closing
import urllib
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/?cat=188&paged=%d' % (pagenumber)
try:
with closing(urllib.urlopen(url)) as page:
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
except IOError as e:
break
But, the script never end and if I finish (ctrl +c) i get:
Code:
Traceback (most recent call last):
File "/home/XXX/Escritorio/while3.py", line 14, in <module>
with closing(urllib.urlopen(url)) as page:
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 345, in open_http
errcode, errmsg, headers = h.getreply()
File "/usr/lib/python2.7/httplib.py", line 1102, in getreply
response = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.7/socket.py", line 430, in readline
data = recv(1)
|

March 3rd, 2013, 11:22 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 12
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by hu0r Now I get this:
Code:
Traceback (most recent call last):
File "/home/XXX/Escritorio/while3.py", line 20, in <module>
except URLError as e:
NameError: name 'URLError' is not defined
 |
That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: http://bugs.python.org/issue12955.
Anyway, try this:
PHP Code:
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/cat?=188&paged=%d' % (pagenumber)
try:
page = urllib2.urlopen(url)
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
page.close()
except IOError as e:
break
|

March 3rd, 2013, 11:34 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by Nolander21 That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: xxx
Anyway, try this:
PHP Code:
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/cat?=188&paged=%d' % (pagenumber)
try:
page = urllib2.urlopen(url)
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
page.close()
except IOError as e:
break
|
Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?
Thanks for your time!
|

March 3rd, 2013, 11:44 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 12
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by hu0r Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?
Thanks for your time! |
No problem. 'with' is the preferred method for file handling because it safely closes the file after its use. And now that I think about it, your code should preferably be like this:
PHP Code:
pagenumber = 0
while(1):
pagenumber = pagenumber +1
url = 'somesite/cat?=188&paged=%d' % (pagenumber)
try:
page = urllib2.urlopen(url)
data = page.read()
soup = BeautifulSoup(data)
xa = soup.findAll('a', href=True)
for web in xa:
if web['href'].startswith('example'):
print web['href']
except IOError as e:
break
finally:
page.close()
This way, even if something goes wrong in your try block, page will always be closed safely.
EDIT:On Third thought, if this last thing doesn't work, forget about it. (Finally will probably execute despite the break statement in the except block)
|

March 3rd, 2013, 11:58 PM
|
|
Registered User
|
|
Join Date: Mar 2013
Posts: 16
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
|
|
Ok. Thanks again, definitely solved!
Greetings!

|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|