Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old July 20th, 2012, 06:15 AM
ahnakel ahnakel is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 4 ahnakel User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 25 m 40 sec
Reputation Power: 0
Screenscraping an authenticated site.

We are using an implementation of Gomez here at work and for some of the configurations you have to manually go in and copy and paste into a backup document. That being said, we can't always remember to grab configs for every change so we want to automate this. I put this python script together to automate pulling the page and saving it into a document but when I try to run the code I have http debugging turned on so i get a response that it first runs and gets http 401 unauthorized error then runs a second time with the headers i have supplied. Why does it run once before including my headers?

import mechanize
import cookielib
Code:
# Browser
br = mechanize.Browser()

#Filename Declaration
filename = "Page.html"

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(False)
br.set_handle_gzip(True)
br.set_handle_redirect(False)
br.set_handle_referer(False)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages?
br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# Set Headers
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1')]
br.addheaders = [('Transfer-Encoding', 'chunked')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
br.addheaders = [('Accept-Language','en-us,en;q=0.5')]
br.addheaders = [('Authorization','xxxx')]
br.addheaders = [('Cookie','xxxx]
br.addheaders = [('Host','xxxx.com')]
br.addheaders = [('Proxy-Connection','keep-alive')]
br.addheaders = [('Referer','http://xxxx.com/atscon')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]

# Add some 
br.add_password('http://xxxx.com', 'xxxx', 'xxxx')
br.open('http://xxxxxx/LocImpExp?op=export')

pagedata = br.response().read()

# Write to file
FILE = open(filename,"w")
FILE.write(pagedata)
FILE.close ()
print "File has been output to filename" 


Let me know if i need to supply something else.
Thanks

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Screenscraping an authenticated site.

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap