#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Location
    United Kingdom
    Posts
    5
    Rep Power
    0

    Easier way to accomplish this?


    I'm a python newbie (learning python after having learnt c#) what I'm trying to do is create two scripts. One that downloads the webpage information and another scripts that downloads links and outputs a summary of the total number of links downloaded into a list.

    First script (Download webpage)
    Code:
    import sys, urllib
    def getWebpage(url):
        print '[*] getWebpage()'
        url_file = urllib.urlopen(url)
        page = url_file.read()
        return page
    def main():
        sys.argv.append('http://www.funeralformyfat.tumblr.com')
        if len(sys.argv) != 2:
            print '[-] Usage: webpage_get URL'
            return
    
    print getWebpage(sys.argv[1]) <---- IndexError: list index out of range
    
    if __name__ == '__main__':
        main()
    Second script (downloads links and outputs a summary of the total number of links downloaded into a list.)
    Code:
    def print_links(page):
        print '[*] print_links()'
        links = re.findall(r'\<a.*href\=.*http\:.+', page)
        links.sort()
        print '[+]', str(len(links)), 'HyperLinks Found:'
    
    for link in links: THIS LINE THROWS UP AN ERROR ( NameError: name 'links' is not defined
        print link
    
    def main():
        sys.argv.append('http://www.funeralformyfat.tumblr.com')
        if len(sys.argv) != 2:
            print '[-] Usage: webpage_getlinks URL'
            return
            page = webpage_get.wget(sys.argv[1])
            print_links(page)
    
    from os.path import join
    
    directory = join('/home/', y, '/newdir/')
    file_name = url.split('/')[-1]
    file_name = join(directory, file_name)
    
    
    if __name__ == '__main__':
        main()
    Is there an easier way to do this or can anyone pinpoint how I fix my errors? I'm using a IDLE GUI to code with Python 2.7. I'm still new and looking to learn all I can so if anyone can help I'm forever in your debt.

    Natalie
  2. #2
  3. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2013
    Location
    Saint-Petersburg, Russia
    Posts
    236
    Rep Power
    28
    It's nice you (unlike many) surrounded your code with code tags.

    print getWebpage(sys.argv[1]) <---- IndexError: list index out of range
    But you did not tell how you call this script. It looks like something should be passed in the command line after the script name!

    for link in links: THIS LINE THROWS UP AN ERROR ( NameError: name 'links'
    It looks you have unindent here, so python thinks the method body is ended and "for" is in global scope, not in the method (where links were defined).
    CodeAbbey - programming problems for novice coders
  4. #3
  5. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    Category: simplifiction.
    In the first script the return statement in the main function doesn't do anything that the function wouldn't do on its own. Remove it. main will still return None


    Category: fix an error.
    In the first script I guess that you ran the program with no arguments. sys.argv[0] contains the invocation. Index 0 is the only valid index of the sys.argv list. From your C# experience you expected
    Code:
    if __name__ == '__main__':
        main()
    to run first, and for main to display the use message and return you to the operating system shell. I was similarly disoriented as a python beginner, and for an embarrassingly long time as well. Python executes the module statements in order doing exactly
    • import sys and urllib
    • define getWebpage
    • define main
    • execute "print getWebpage(sys.argv[1])"
    which raises the exception. python didn't evaluate "if __name__ ..."


    Category: fix an error.
    In the second script, and you might now understand the problem, python executes the script from top down
    • defines but does not use print_links. Meaning it inserted the name print_links into the module global name space and associated with that the python byte code of the definition. Following the definition, only the "print_links" identifier was loaded into the module global name space.
    • execute the statement "for link in links:" which should now be an obvious problem.


    Category: fix an error.
    You haven't arrived here yet. The name "links" defined in function print_links is local. Even following an execution of find_links you'll still have the same error. Use the global statement to change a variable's scope.

    Avoid global variables when practical. Passing data as function arguments and return statements or exploiting classes are alternatives to global variables.
    [code]Code tags[/code] are essential for python code and Makefiles!
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Location
    United Kingdom
    Posts
    5
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Category: simplifiction.
    In the first script the return statement in the main function doesn't do anything that the function wouldn't do on its own. Remove it. main will still return None


    Category: fix an error.
    In the first script I guess that you ran the program with no arguments. sys.argv[0] contains the invocation. Index 0 is the only valid index of the sys.argv list. From your C# experience you expected
    Code:
    if __name__ == '__main__':
        main()
    to run first, and for main to display the use message and return you to the operating system shell. I was similarly disoriented as a python beginner, and for an embarrassingly long time as well. Python executes the module statements in order doing exactly
    • import sys and urllib
    • define getWebpage
    • define main
    • execute "print getWebpage(sys.argv[1])"
    which raises the exception. python didn't evaluate "if __name__ ..."


    Category: fix an error.
    In the second script, and you might now understand the problem, python executes the script from top down
    • defines but does not use print_links. Meaning it inserted the name print_links into the module global name space and associated with that the python byte code of the definition. Following the definition, only the "print_links" identifier was loaded into the module global name space.
    • execute the statement "for link in links:" which should now be an obvious problem.


    Category: fix an error.
    You haven't arrived here yet. The name "links" defined in function print_links is local. Even following an execution of find_links you'll still have the same error. Use the global statement to change a variable's scope.

    Avoid global variables when practical. Passing data as function arguments and return statements or exploiting classes are alternatives to global variables.
    Hi, Thanks for the reply. I've updated my code and my getWebpage script is working however my links webpage doesn't appear to be working, could you take look and see if you can see why? It runs but it doesn't return links and I can't figure out why?

    Webpage Code;
    Code:
    import sys, urllib
    def getWebpage(url):
        print '[*] getWebpage()'
        url_file = urllib.urlopen(url)
        page = url_file.read()
        return page
    def main():
        sys.argv.append('http://www.funeralformyfat.tumblr.com)
        if len(sys.argv) != 2:
            print '[-] Usage: webpage_get URL'
            return
        else:
            print getWebpage(sys.argv[1])
    
    if __name__ == '__main__':
        main()
    Links code
    Code:
    import sys, urllib
    def print_links(page):
        print '[*] print_links()'
        links = re.findall(r'\<a.*href\=.*http\:.+', page)
        links.sort()
        print '[+]', str(len(links)), 'HyperLinks Found:'
    
        for link in links:
            print link
        
    def main():
        sys.argv.append('http://www.funeralformyfat.tumblr.com')
        if len(sys.argv) != 2:
            print '[-] Usage: webpage_links URL'
            return
            page = webpage_get.getWebpage(sys.argv[1])
            print_links(page)
    
            
    if __name__ == '__main__':
        main()
    Thanks
  8. #5
  9. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    Maybe you intended
    Code:
    def main():
        sys.argv.append('http://www.funeralformyfat.tumblr.com')
        if len(sys.argv) != 2:
            print '[-] Usage: webpage_links URL'
            return
        page = webpage_get.getWebpage(sys.argv[1])
        print_links(page)
    [code]Code tags[/code] are essential for python code and Makefiles!
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Location
    United Kingdom
    Posts
    5
    Rep Power
    0
    Hi, thanks for the reply but now I'm getting a completely different answer. Sorry, if I sound stupid, it's new to me and I'm still trying to get to grips with it all.

    The error I now get is NameError: global name 'webpage_get' is not defined. I think the issue is it's not picking up my first script but I'm unsure how to resolve this?
  12. #7
  13. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,894
    Rep Power
    481
    I've vacillated between advising you to work through the python tutorial versus encouraging you to learn by trial of your own program. We'll give this latter approach another try.

    You need to import the module, just like you did for sys and urllib.
    Code:
    import sys, urllib
    [code]Code tags[/code] are essential for python code and Makefiles!
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Location
    United Kingdom
    Posts
    5
    Rep Power
    0
    Fixed the issue. Thanks for your help

IMN logo majestic logo threadwatch logo seochat tools logo