
July 20th, 2012, 06:15 AM
|
|
Registered User
|
|
Join Date: Apr 2012
Posts: 4
Time spent in forums: 1 h 25 m 40 sec
Reputation Power: 0
|
|
|
Screenscraping an authenticated site.
We are using an implementation of Gomez here at work and for some of the configurations you have to manually go in and copy and paste into a backup document. That being said, we can't always remember to grab configs for every change so we want to automate this. I put this python script together to automate pulling the page and saving it into a document but when I try to run the code I have http debugging turned on so i get a response that it first runs and gets http 401 unauthorized error then runs a second time with the headers i have supplied. Why does it run once before including my headers?
import mechanize
import cookielib
Code:
# Browser
br = mechanize.Browser()
#Filename Declaration
filename = "Page.html"
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(False)
br.set_handle_gzip(True)
br.set_handle_redirect(False)
br.set_handle_referer(False)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# Set Headers
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1')]
br.addheaders = [('Transfer-Encoding', 'chunked')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
br.addheaders = [('Accept-Language','en-us,en;q=0.5')]
br.addheaders = [('Authorization','xxxx')]
br.addheaders = [('Cookie','xxxx]
br.addheaders = [('Host','xxxx.com')]
br.addheaders = [('Proxy-Connection','keep-alive')]
br.addheaders = [('Referer','http://xxxx.com/atscon')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
# Add some
br.add_password('http://xxxx.com', 'xxxx', 'xxxx')
br.open('http://xxxxxx/LocImpExp?op=export')
pagedata = br.response().read()
# Write to file
FILE = open(filename,"w")
FILE.write(pagedata)
FILE.close ()
print "File has been output to filename"
Let me know if i need to supply something else.
Thanks
|