July 20th, 2012, 07:15 AM
Screenscraping an authenticated site.
We are using an implementation of Gomez here at work and for some of the configurations you have to manually go in and copy and paste into a backup document. That being said, we can't always remember to grab configs for every change so we want to automate this. I put this python script together to automate pulling the page and saving it into a document but when I try to run the code I have http debugging turned on so i get a response that it first runs and gets http 401 unauthorized error then runs a second time with the headers i have supplied. Why does it run once before including my headers?
Let me know if i need to supply something else.
br = mechanize.Browser()
filename = "Page.html"
# Cookie Jar
cj = cookielib.LWPCookieJar()
# Browser options
# Follows refresh 0 but not hangs on refresh > 0
# Want debugging messages?
# Set Headers
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1')]
br.addheaders = [('Transfer-Encoding', 'chunked')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
br.addheaders = [('Accept-Language','en-us,en;q=0.5')]
br.addheaders = [('Authorization','xxxx')]
br.addheaders = [('Cookie','xxxx]
br.addheaders = [('Host','xxxx.com')]
br.addheaders = [('Proxy-Connection','keep-alive')]
br.addheaders = [('Referer','http://xxxx.com/atscon')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
# Add some
br.add_password('http://xxxx.com', 'xxxx', 'xxxx')
pagedata = br.response().read()
# Write to file
FILE = open(filename,"w")
print "File has been output to filename"