Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old March 3rd, 2013, 09:20 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
While + BeautifulSoup + urllib2

Hello guys, i have a question with the loop "while" and the library Beautiful Soup. If i have the page site.com/?cat=1&page=1 and the category have, for example, 10 pages. I wanna parse all of them (with Beautiful Soup), maybe using while, but i don't know who.

Actually, i use this code:

PHP Code:
 import urllib2
from BeautifulSoup import BeautifulSoup

        url 
'somesite/?cat=188&paged=1'
        
data urllib2.urlopen(url).read()
        
soup BeautifulSoup(data)
        
xa soup.findAll('a'href=True)

        
url 'somesite/cat?=188&paged=2'
        
data urllib2.urlopen(url).read()
        
soup BeautifulSoup(data)
        
xb soup.findAll('a'href=True)

       
# ... MORE URLS ...

        
for web in xa:
            if 
web['href'].startswith('example'):
                print 
web['href']
    
        for 
web in xb:
            if 
web['href'].startswith('example'):
                print 
web['href']

       
# ... MORE FOR's ... 


but is a very rustic code to parse too many pages. If anyone can help me, i'll be grateful.

Reply With Quote
  #2  
Old March 3rd, 2013, 10:28 PM
Nolander21 Nolander21 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 12 Nolander21 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
PHP Code:
 pagenumber 0
while(1):
    
pagenumber pagenumber +1
    url 
'somesite/cat?=188&paged=%d' % (pagenumber)
    try:
        
with urllib2.urlopen(url) as page
            
data page.read()
            
soup BeautifulSoup(data)
            
xa soup.findAll('a'href=True)

            for 
web in xa:
                if 
web['href'].startswith('example'):
                    print 
web['href']
    
except IOError as e:
        break 

Reply With Quote
  #3  
Old March 3rd, 2013, 10:53 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
Quote:
Originally Posted by Nolander21
PHP Code:
 pagenumber 0
while(1):
    
pagenumber pagenumber +1
    url 
'somesite/cat?=188&paged=%d' % (pagenumber)
    try:
        
with urllib2.urlopen(url) as page
            
data page.read()
            
soup BeautifulSoup(data)
            
xa soup.findAll('a'href=True)

            for 
web in xa:
                if 
web['href'].startswith('example'):
                    print 
web['href']
    
except IOError as e:
        break 


Hello, thanks for replying. But I get this error:

Code:
Traceback (most recent call last):
  File "/home/XXX/Escritorio/while3.py", line 12, in <module>
    with urllib2.urlopen(url) as page:
AttributeError: addinfourl instance has no attribute '__exit__'


PD: I have Python 2.7

Reply With Quote
  #4  
Old March 3rd, 2013, 10:58 PM
Nolander21 Nolander21 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 12 Nolander21 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
I've never actually used that library. Try changing IOError to URLError

Reply With Quote
  #5  
Old March 3rd, 2013, 11:03 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
Quote:
Originally Posted by Nolander21
I've never actually used that library. Try changing IOError to URLError


Now I get this:

Code:
Traceback (most recent call last):
  File "/home/XXX/Escritorio/while3.py", line 20, in <module>
    except URLError as e:
NameError: name 'URLError' is not defined



Reply With Quote
  #6  
Old March 3rd, 2013, 11:19 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
Sorry for the double post, but i "fix" the error with this code:


PHP Code:
 from BeautifulSoup import BeautifulSoup
from contextlib import closing
import urllib

pagenumber 
0
while(1): 
    
pagenumber pagenumber +
    url 
'somesite/?cat=188&paged=%d' % (pagenumber
    try: 
        
with closing(urllib.urlopen(url)) as page:  
            
data page.read() 
            
soup BeautifulSoup(data
            
xa soup.findAll('a'href=True

            for 
web in xa
                if 
web['href'].startswith('example'): 
                    print 
web['href'
    
except IOError as e
        break 


But, the script never end and if I finish (ctrl +c) i get:

Code:
Traceback (most recent call last):
  File "/home/XXX/Escritorio/while3.py", line 14, in <module>
    with closing(urllib.urlopen(url)) as page:  
  File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.7/urllib.py", line 207, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 345, in open_http
    errcode, errmsg, headers = h.getreply()
  File "/usr/lib/python2.7/httplib.py", line 1102, in getreply
    response = self._conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 430, in readline
    data = recv(1)

Reply With Quote
  #7  
Old March 3rd, 2013, 11:22 PM
Nolander21 Nolander21 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 12 Nolander21 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
Quote:
Originally Posted by hu0r
Now I get this:

Code:
Traceback (most recent call last):
  File "/home/XXX/Escritorio/while3.py", line 20, in <module>
    except URLError as e:
NameError: name 'URLError' is not defined




That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: http://bugs.python.org/issue12955.
Anyway, try this:

PHP Code:
 pagenumber 0
while(1):
    
pagenumber pagenumber +1
    url 
'somesite/cat?=188&paged=%d' % (pagenumber)
    try:
        
page urllib2.urlopen(url
        
data page.read()
        
soup BeautifulSoup(data)
        
xa soup.findAll('a'href=True)

        for 
web in xa:
            if 
web['href'].startswith('example'):
                print 
web['href']
        
page.close()
    
except IOError as e:
        break 

Reply With Quote
  #8  
Old March 3rd, 2013, 11:34 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
Quote:
Originally Posted by Nolander21
That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: xxx
Anyway, try this:

PHP Code:
 pagenumber 0
while(1):
    
pagenumber pagenumber +1
    url 
'somesite/cat?=188&paged=%d' % (pagenumber)
    try:
        
page urllib2.urlopen(url
        
data page.read()
        
soup BeautifulSoup(data)
        
xa soup.findAll('a'href=True)

        for 
web in xa:
            if 
web['href'].startswith('example'):
                print 
web['href']
        
page.close()
    
except IOError as e:
        break 


Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?

Thanks for your time!

Reply With Quote
  #9  
Old March 3rd, 2013, 11:44 PM
Nolander21 Nolander21 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 12 Nolander21 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 4 h 29 m 9 sec
Reputation Power: 0
Quote:
Originally Posted by hu0r
Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?

Thanks for your time!


No problem. 'with' is the preferred method for file handling because it safely closes the file after its use. And now that I think about it, your code should preferably be like this:
PHP Code:
 pagenumber 0
while(1):
    
pagenumber pagenumber +1
    url 
'somesite/cat?=188&paged=%d' % (pagenumber)
    try:
        
page urllib2.urlopen(url
        
data page.read()
        
soup BeautifulSoup(data)
        
xa soup.findAll('a'href=True)

        for 
web in xa:
            if 
web['href'].startswith('example'):
                print 
web['href']
        
    
except IOError as e:
        break  
    
finally:
        
page.close() 


This way, even if something goes wrong in your try block, page will always be closed safely.

EDIT:On Third thought, if this last thing doesn't work, forget about it. (Finally will probably execute despite the break statement in the except block)

Reply With Quote
  #10  
Old March 3rd, 2013, 11:58 PM
hu0r hu0r is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Mar 2013
Posts: 16 hu0r User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 8 h 43 m 14 sec
Reputation Power: 0
Ok. Thanks again, definitely solved!

Greetings!


Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > While + BeautifulSoup + urllib2

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap