#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0

    Duplicated links when parsing


    Hello guys, I have a new question... When I parsing a web, duplicated links are displayed which is a bit annoying.

    The thags:

    Code:
    <div style="padding:10px 0px 0px 0px;">
    <div class="pagerighttitle">
    <a href="http://somesite/f99/2144415-imagenes-wtf/">
    <img src="http://somesite/wp-content/uploads/2013/02/SHb0t-120x73.jpg" alt="Imágenes WTF" width="120" height="73" align="left" /></a>
    <h5> &nbsp;
    <img src="http://image/azdh92.png" border="0" /> <a href="http://somesite/f99/2144415-imagenes-wtf/"> Imágenes WTF </a>
    </h5>
    <center>
    <p>WTF! WTF!!</p>
    </center>
    <div style="clear:both;">
    </div>
    </div>
    </div>
    And the parsing code:

    PHP Code:
     pagenumber 
    while(1): 
        
    pagenumber pagenumber +
        url 
    'somesite/cat?=188&paged=%d' % (pagenumber
        try: 
            
    page urllib2.urlopen(url)  
            
    data page.read() 
            
    soup BeautifulSoup(data
            
    xa soup.findAll('a'href=True

            for 
    web in xa
                if 
    web['href'].startswith('http://somesite/f99/'): 
                    print 
    web['href'
             
        
    except IOError as e
            break   
        
    finally
            
    page.close() 
    The problem is, there are two identical tags: <a href="somesite/f99/2144415-imagenes-wtf/"> Imágenes WTF </a> and when "parsing" prints both. I do not know if there is any way to tell them apart, or perhaps only print the first (or second). Or just delete duplicate data when printing.

    Code:
    http://somesite/f99/2074481-funny-pictures-259-a/
    http://somesite/f99/2074481-funny-pictures-259-a/
    http://somesite/f99/2073758-la-vida-de-samara-despues-del-exito-de-el-aro/
    http://somesite/f99/2073758-la-vida-de-samara-despues-del-exito-de-el-aro/
    http://somesite/f99/2073493-mas-mas-imagenes-60-a/
    http://somesite/f99/2073493-mas-mas-imagenes-60-a/
    ...
    etc
    If anyone knows any command, I will be very grateful! Thanks and regards!!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,997
    Rep Power
    481
    Something like this might reduce the problems from which you suffer.
    Code:
    class myprint:
    
        def __init__(self):
            self.previous = ''
    
        def __call__(self,message,Tuple=()):
            s = message%Tuple
            if s != self.previous:
                self.previous = s
                print(s)
    
    pr = myprint()
    
    #...
    
                if web['href'].startswith('http://somesite/f99/'):
                    pr(web['href'])
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Something like this might reduce the problems from which you suffer.
    Code:
    class myprint:
    
        def __init__(self):
            self.previous = ''
    
        def __call__(self,message,Tuple=()):
            s = message%Tuple
            if s != self.previous:
                self.previous = s
                print(s)
    
    pr = myprint()
    
    #...
    
                if web['href'].startswith('http://somesite/f99/'):
                    pr(web['href'])
    OMG! Thats works!! I'll have to study a little about "classes"!
    Thank you very much!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    (Sorry for double post)

    I do not want to open another entry, so here we go... When I want to export the list of results to a file, I have a problem.

    If I use this:

    PHP Code:
    with open('chicas2.txt''a') as f:
         
    f.write(str(web['href'])) 
    I get the duplicated links. (Read above)

    And I cant use the class defined above, i.e.:

    PHP Code:
    with open('chicas2.txt''a') as f:
         
    f.write(str(pr(web['href']))) 
    Furthermore, in the first case, the URLs get stacked, ie without line breaks. Can anyone help me?
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,997
    Rep Power
    481
    so change the __call__ method to __str__ then ...
    Oh no, that won't work.

    Please provide the sample input and the corresponding output. And, if it's not obvious from notes in this thread, tell the reason for the output.

    In other words, provide some test cases.
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by b49P23TIvg
    so change the __call__ method to __str__ then ...
    Oh no, that won't work.

    Please provide the sample input and the corresponding output. And, if it's not obvious from notes in this thread, tell the reason for the output.

    In other words, provide some test cases.
    Thanks for replying. Let me explain.

    The code I'm writing is to list all entries that has had a web page, in order to identify the creator of each entry (for statistics), to do this, had thought (since I do not have much knowledge in python) export the file links to a "text" and then create another script to read again (parsing) and extract the information I need. Alternatively (and more difficult for the reason noted above) would simply not print again links and make a "parsing" to extract the information.

    Well, what you asked. Streamline the code to show only what's important.

    Input:

    PHP Code:
    #!/usr/bin/python
    # -*- coding: utf-8 -*-

    import urllib2
    from BeautifulSoup import BeautifulSoup

    class myprint:

        
    def __init__(self):
            
    self.previous ''

        
    def __call__(self,message,Tuple=()):
            
    message%Tuple
            
    if != self.previous:
                
    self.previous s
                
    print(s)

    pr myprint()


    # --------------------

    # -------- CODE ----------------

    # -------------------
             
                            
    if section == "Humor":
                                
    pagenumber 
                                
    while(1): 
                                    
    pagenumber pagenumber +
                                    url 
    'http://somesite/?cat=188&paged=%d' % (pagenumber
                                    try: 
                                        
    page urllib2.urlopen(url)  
                                        
    data page.read() 
                                        
    soup BeautifulSoup(data
                                        
    xa soup.findAll('a'href=True

                                        for 
    web in xa
                                            if 
    web['href'].startswith('http://somesite.cl/foro/f99'): 
                                                
    pr(web['href']) 
                                    
    except IOError as e
                                        break  
                                    
    finally:
                                        
    page.close()
                                break
                            
    # ------------------------------------

    # MORE CODE

    # ------------------------------------

    print "Finish..."
    raw_input() 
    Output (console):

    Code:
    http://somesite/foro/f240/2088338
    http://somesite/foro/f240/2082419
    http://somesite/foro/f240/2080238
    http://somesite/foro/f240/2074266
    http://somesite/foro/f240/2078141
    http://somesite/foro/f240/2076096
    ...
    Etc..
    ...
    Enter to Finish...
    Exporting to a text file using:

    PHP Code:
    with open('chicas2.txt''a') as f
         
    f.write(str(web['href'])) 
    PHP Code:
    #!/usr/bin/python
    # -*- coding: utf-8 -*-

    import urllib2
    from BeautifulSoup import BeautifulSoup

    class myprint:

        
    def __init__(self):
            
    self.previous ''

        
    def __call__(self,message,Tuple=()):
            
    message%Tuple
            
    if != self.previous:
                
    self.previous s
                
    print(s)

    pr myprint()


    # --------------------

    # -------- CODE ----------------

    # -------------------
             
                            
    if section == "Humor":
                                
    pagenumber 
                                
    while(1): 
                                    
    pagenumber pagenumber +
                                    url 
    'http://somesite/?cat=188&paged=%d' % (pagenumber
                                    try: 
                                        
    page urllib2.urlopen(url)  
                                        
    data page.read() 
                                        
    soup BeautifulSoup(data
                                        
    xa soup.findAll('a'href=True

                                        for 
    web in xa
                                            if 
    web['href'].startswith('http://somesite.cl/foro/f99'): 
                                                
    pr(web['href'])
                                                
    with open('chicas2.txt''a') as f
                                                    
    f.write(str(web['href'])) 
                                    
    except IOError as e
                                        break  
                                    
    finally:
                                        
    page.close()
                                break
                            
    # ------------------------------------

    # MORE CODE

    # ------------------------------------

    print "Finish..."
    raw_input() 
    Output, chicas2.txt (Duplicated links and without break lines):

    Code:
    http://somesite/foro/f240/2145088http://somesite/foro/f240/2http://somesite/foro/f240/2144235http://somesite/foro/f240/2144235http://somesite/foro/f240/2143301http://somesite/foro/f240/2143301
    Thanks!!
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    This might be a dirty solution depending on if duplicates always exist or not, but you can try


    PHP Code:
    with open('chicas2.txt''a') as f:
        
    alt 0
        
    for web in xa
            if 
    web['href'].startswith('http://somesite.cl/foro/f99'):
                if 
    alt == 0:
                    
    f.write(str(web['href']))
                
    alt alt+

    I also rearranged your block a little bit by putting everything under the open file statement because presumably you only need to open it once. (Maybe you could even put it outside the 'while' loop?)
  14. #8
  15. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,997
    Rep Power
    481
    Just collect a list of all the href's, make a set, then display the set.


    >>> set('aaaaaaaaaabb')
    set(['a', 'b'])

    >>> set(LIST_OF_ALL_MY_URLS)
    set(['unique', 'urls', 'only'])
    [code]Code tags[/code] are essential for python code and Makefiles!
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Nolander21
    This might be a dirty solution depending on if duplicates always exist or not, but you can try


    PHP Code:
    with open('chicas2.txt''a') as f:
        
    alt 0
        
    for web in xa
            if 
    web['href'].startswith('http://somesite.cl/foro/f99'):
                if 
    alt == 0:
                    
    f.write(str(web['href']))
                
    alt alt+

    I also rearranged your block a little bit by putting everything under the open file statement because presumably you only need to open it once. (Maybe you could even put it outside the 'while' loop?)
    Thats works! But the links are obtained without newlines. Any idea?

    Originally Posted by b49P23TIvg
    Just collect a list of all the href's, make a set, then display the set.


    >>> set('aaaaaaaaaabb')
    set(['a', 'b'])

    >>> set(LIST_OF_ALL_MY_URLS)
    set(['unique', 'urls', 'only'])
    how I can use this? Can you explain me??
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    12
    Rep Power
    0
    PHP Code:
    with open('chicas2.txt''a') as f:
        
    alt 0
        
    for web in xa
            if 
    web['href'].startswith('http://somesite.cl/foro/f99'):
                if 
    alt == 0:
                    
    f.write(str(web['href']) + '\n')
                
    alt alt+
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    16
    Rep Power
    0
    Originally Posted by Nolander21
    PHP Code:
    with open('chicas2.txt''a') as f:
        
    alt 0
        
    for web in xa
            if 
    web['href'].startswith('http://somesite.cl/foro/f99'):
                if 
    alt == 0:
                    
    f.write(str(web['href']) + '\n')
                
    alt alt+
    I'm so stupid, had dealt with that and I had not worked. Anyway thanks!

IMN logo majestic logo threadwatch logo seochat tools logo