#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0

    While + BeautifulSoup + urllib2


    Hello guys, i have a question with the loop "while" and the library Beautiful Soup. If i have the page site.com/?cat=1&page=1 and the category have, for example, 10 pages. I wanna parse all of them (with Beautiful Soup), maybe using while, but i don't know who.

    Actually, i use this code:

    PHP Code:
    import urllib2
    from BeautifulSoup import BeautifulSoup

            url 
    'somesite/?cat=188&paged=1'
            
    data urllib2.urlopen(url).read()
            
    soup BeautifulSoup(data)
            
    xa soup.findAll('a'href=True)

            
    url 'somesite/cat?=188&paged=2'
            
    data urllib2.urlopen(url).read()
            
    soup BeautifulSoup(data)
            
    xb soup.findAll('a'href=True)

           
    # ... MORE URLS ...

            
    for web in xa:
                if 
    web['href'].startswith('example'):
                    print 
    web['href']
        
            for 
    web in xb:
                if 
    web['href'].startswith('example'):
                    print 
    web['href']

           
    # ... MORE FOR's ... 
    but is a very rustic code to parse too many pages. If anyone can help me, i'll be grateful.
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    PHP Code:
    pagenumber 0
    while(1):
        
    pagenumber pagenumber +1
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber)
        try:
            
    with urllib2.urlopen(url) as page
                
    data page.read()
                
    soup BeautifulSoup(data)
                
    xa soup.findAll('a'href=True)

                for 
    web in xa:
                    if 
    web['href'].startswith('example'):
                        print 
    web['href']
        
    except IOError as e:
            break 
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Nolander21
    PHP Code:
    pagenumber 0
    while(1):
        
    pagenumber pagenumber +1
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber)
        try:
            
    with urllib2.urlopen(url) as page
                
    data page.read()
                
    soup BeautifulSoup(data)
                
    xa soup.findAll('a'href=True)

                for 
    web in xa:
                    if 
    web['href'].startswith('example'):
                        print 
    web['href']
        
    except IOError as e:
            break 
    Hello, thanks for replying. But I get this error:

    Code:
    Traceback (most recent call last):
      File "/home/XXX/Escritorio/while3.py", line 12, in <module>
        with urllib2.urlopen(url) as page:
    AttributeError: addinfourl instance has no attribute '__exit__'
    PD: I have Python 2.7
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    I've never actually used that library. Try changing IOError to URLError
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Nolander21
    I've never actually used that library. Try changing IOError to URLError
    Now I get this:

    Code:
    Traceback (most recent call last):
      File "/home/XXX/Escritorio/while3.py", line 20, in <module>
        except URLError as e:
    NameError: name 'URLError' is not defined
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Sorry for the double post, but i "fix" the error with this code:


    PHP Code:
    from BeautifulSoup import BeautifulSoup
    from contextlib import closing
    import urllib

    pagenumber 
    0
    while(1): 
        
    pagenumber pagenumber +
        url 
    'somesite/?cat=188&paged=%d' % (pagenumber
        try: 
            
    with closing(urllib.urlopen(url)) as page:  
                
    data page.read() 
                
    soup BeautifulSoup(data
                
    xa soup.findAll('a'href=True

                for 
    web in xa
                    if 
    web['href'].startswith('example'): 
                        print 
    web['href'
        
    except IOError as e
            break 
    But, the script never end and if I finish (ctrl +c) i get:

    Code:
    Traceback (most recent call last):
      File "/home/XXX/Escritorio/while3.py", line 14, in <module>
        with closing(urllib.urlopen(url)) as page:  
      File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
        return opener.open(url)
      File "/usr/lib/python2.7/urllib.py", line 207, in open
        return getattr(self, name)(url)
      File "/usr/lib/python2.7/urllib.py", line 345, in open_http
        errcode, errmsg, headers = h.getreply()
      File "/usr/lib/python2.7/httplib.py", line 1102, in getreply
        response = self._conn.getresponse()
      File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
        response.begin()
      File "/usr/lib/python2.7/httplib.py", line 407, in begin
        version, status, reason = self._read_status()
      File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
        line = self.fp.readline()
      File "/usr/lib/python2.7/socket.py", line 430, in readline
        data = recv(1)
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Originally Posted by hu0r
    Now I get this:

    Code:
    Traceback (most recent call last):
      File "/home/XXX/Escritorio/while3.py", line 20, in <module>
        except URLError as e:
    NameError: name 'URLError' is not defined
    That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: http://bugs.python.org/issue12955.
    Anyway, try this:

    PHP Code:
    pagenumber 0
    while(1):
        
    pagenumber pagenumber +1
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber)
        try:
            
    page urllib2.urlopen(url
            
    data page.read()
            
    soup BeautifulSoup(data)
            
    xa soup.findAll('a'href=True)

            for 
    web in xa:
                if 
    web['href'].startswith('example'):
                    print 
    web['href']
            
    page.close()
        
    except IOError as e:
            break 
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Nolander21
    That's simply because you needed to import URLError. But since it is a subclass of IOError, IOError should have worked. I think the problem is with the with statement. Some unfortunate bug with that particular lib: xxx
    Anyway, try this:

    PHP Code:
    pagenumber 0
    while(1):
        
    pagenumber pagenumber +1
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber)
        try:
            
    page urllib2.urlopen(url
            
    data page.read()
            
    soup BeautifulSoup(data)
            
    xa soup.findAll('a'href=True)

            for 
    web in xa:
                if 
    web['href'].startswith('example'):
                    print 
    web['href']
            
    page.close()
        
    except IOError as e:
            break 
    Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?

    Thanks for your time!
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    Originally Posted by hu0r
    Hey! That works! Why you removed the "with" and you added the "page.close ()"? Can you explain me?

    Thanks for your time!
    No problem. 'with' is the preferred method for file handling because it safely closes the file after its use. And now that I think about it, your code should preferably be like this:
    PHP Code:
    pagenumber 0
    while(1):
        
    pagenumber pagenumber +1
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber)
        try:
            
    page urllib2.urlopen(url
            
    data page.read()
            
    soup BeautifulSoup(data)
            
    xa soup.findAll('a'href=True)

            for 
    web in xa:
                if 
    web['href'].startswith('example'):
                    print 
    web['href']
            
        
    except IOError as e:
            break  
        
    finally:
            
    page.close() 
    This way, even if something goes wrong in your try block, page will always be closed safely.

    EDIT:On Third thought, if this last thing doesn't work, forget about it. (Finally will probably execute despite the break statement in the except block)
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Ok. Thanks again, definitely solved!

    Greetings!


IMN logo majestic logo threadwatch logo seochat tools logo