#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    5
    Rep Power
    0

    Webcrawler Problem


    I am currently learning python and I want to use everything learned about GUI programming to create an interface for an webcrawler.
    It was written in pyton 3.2 as a tutorial how to create a webcrawler. Unfortunataly it doesn't work correctly (anymore?).

    No matter what url and keyword I use, it always returns:
    **Success!**
    Word never found

    It doesn’t continue to scrape pages and, even if Iam certain the word is on the page, if I input that word it still gives the same message.

    I don't know how to resolve this, I think the problem has something to do with this loop but I'm not sure.
    while numberVisited < maxPages and pagesToVisit != [] and not foundWord:

    Does anybody see the problem?


    source code:
    Code:
    from html.parser import HTMLParser
    from urllib.request import urlopen
    from urllib import parse
    
    class LinkParser(HTMLParser):
    
        def handle_starttag(self, tag, attrs):
            if tag == "a" :
                for (key, value) in attrs:
                    if key == "h":
                        newUrl = parse.urljoin(self.baseUrl, value)
                        self.links = self.links + [newUrl]
    
        def getLinks(self, url):
            self.links = []
            self.baseUrl = url
            response = urlopen(url)
            if response.getheader("Content-type") == "text\html":
                htmlBytes = response.read()
                htmlString = htmlBytes.decode("utf-8")
                self.feed(htmlString)
                return htmlString, self.links
            else:
                return "", []
    
    def spider(url, word, maxPages):
        pagesToVisit = [url]
        numberVisited = 0
        foundWord = False
        while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
            numberVisited = numberVisited + 1
            url = pagesToVisit[0]
            pagesToVisit = pagesToVisit[1:]
            try:
                print(numberVisited, "Visiting:", url)
                parser = LinkParser()
                data, links = parser.getLinks(url)
                if data.find(word) >-1:
                    foundWord = True
                pagesToVisit = pagesToVisit + links
                print(" ** Success! ** ")
            except:
                print(" ** Failed! ** ")
        if foundWord:
            print("The word", word, "was found at", url)
        else:
            print("Word never found")
    Tutorial: http://www.netinstructions.com/2011/09/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/#comment-263
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    What was the source code that worked? Don't know? Use a revision control system!
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    5
    Rep Power
    0
    Originally Posted by b49P23TIvg
    What was the source code that worked? Don't know? Use a revision control system!
    The source code I posted above was the exact code from the tutorial.
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    Sorry, I responded to your parenthetical "used to work?".

    Maybe the tutorial is antique.
    Code:
        def getLinks(self, url):
            self.links = []
            self.baseUrl = url
            response = urlopen(url)
            if "text/html" in response.getheader("Content-type"):##################### CHANGED THIS LINE
                htmlBytes = response.read()
                htmlString = htmlBytes.decode("utf-8")
                self.feed(htmlString)
                return htmlString, self.links
            else:
                return "", []

    spider('http://www.dreamhost.com','bloggers',200)
    1 Visiting: http://www.dreamhost.com
    ** Success! **
    The word bloggers was found at http://www.dreamhost.com
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    Look at the freakin' data!
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    5
    Rep Power
    0
    It now checks if the word is on the page of the url entered, but it doesn't check if the words are on linked pages. Could you fix this?
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,855
    Rep Power
    481
    Probably.
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    5
    Rep Power
    0
    Could you please do it or send me on the right way?

IMN logo majestic logo threadwatch logo seochat tools logo