#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2014
    Posts
    12
    Rep Power
    0

    Question Scrapy - http authorization


    Hello there,

    I'm trying to scrape websites with basic http authorization. I tried middleware and overwrite start_requests, but no luck.

    Anyone got the experience to help me out? Thank you in advance.
    P.S. pretty sure there's no user-agent problem.

    code:
    Code:
    import scrapy
    from w3lib.http import basic_auth_header
    import sys  # So to export Chinese characters
    reload(sys)# So to export Chinese characters
    sys.setdefaultencoding('utf8')# So to export Chinese characters
    
    class MyxmlSpider(scrapy.Spider):
    	name = "PXML"
    	
    	def start_requests(self):
    		f = open("Batch.txt")
    		start_urls = [url.strip() for url in f.readlines()]
    		f.close()
    		auth = basic_auth_header("TSn@xxx.com", "cks!")
    		for url in start_urls:
    			yield scrapy.Request(url=url,callback=self.parse,headers={'Authorization': auth})
    			
    
    
    
    	def parse(self, response):
    		myItems  = ['B','BD',ND','NE','SD','SI','SJ','SK','SR','TF','TP','0','1','12']
    		#Rest of the code...
    Last edited by MichSong; March 28th, 2017 at 04:23 AM. Reason: Add post icon
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2014
    Posts
    12
    Rep Power
    0
    Originally Posted by MichSong
    Hello there,

    I'm trying to scrape websites with basic http authorization. I tried middleware and overwrite start_requests, but no luck.

    Anyone got the experience to help me out? Thank you in advance.
    P.S. pretty sure there's no user-agent problem.

    code:
    Code:
    import scrapy
    from w3lib.http import basic_auth_header
    import sys  # So to export Chinese characters
    reload(sys)# So to export Chinese characters
    sys.setdefaultencoding('utf8')# So to export Chinese characters
    
    class MyxmlSpider(scrapy.Spider):
    	name = "PXML"
    	
    	def start_requests(self):
    		f = open("Batch.txt")
    		start_urls = [url.strip() for url in f.readlines()]
    		f.close()
    		auth = basic_auth_header("TSn@xxx.com", "cks!")
    		for url in start_urls:
    			yield scrapy.Request(url=url,callback=self.parse,headers={'Authorization': auth})
    			
    
    
    
    	def parse(self, response):
    		myItems  = ['B','BD',ND','NE','SD','SI','SJ','SK','SR','TF','TP','0','1','12']
    		#Rest of the code...
    solved. I was wrong. This is about user-agent and redirection.

    Thanks,
    Mich

IMN logo majestic logo threadwatch logo seochat tools logo