#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2015
    Posts
    2
    Rep Power
    0

    Web scraping code query


    Hi there,

    I'd be grateful for some advice. I've come across some python code that is used to scrape football player statistics from the premier league website, tidy it up and then dump it in a csv file. The code appears to run smoothly but the data isn't written to the csv file - instead an error file is created and lists every player number I'm trying to obtain data for. The code is below. Can anyone offer some advice as to where it's going wrong? Many thanks, Rob.

    Code:
    import json
    import urllib
    import re
    import io
    from bs4 import BeautifulSoup
    import os
    
    #####################################################################
    #	Extract all of the player based information from the website	#
    #####################################################################
    
    i = 1
    #Open the player file and make it writable
    myfile = open("player_history.txt", "w")
    myfile.close()
    
    #Create a file to contain all the numbers for which there was errors.
    errfile = open("errfile.txt", "w")
    errfile.close()
    
    #Website from which to scrape
    while i < 700:
    	htmltext = urllib.urlopen("http://fantasy.premierleague.com/web/api/elements/" + str(i) + "/")
    
    	#Use a try-except block to ignore htmls that do not relate to players
    	try:
    		#Use the json command to read in the json file
    		data = json.load(htmltext)
    		#Extract the score history from the json file
    		scoredata = data["fixture_history"]["all"]
    		#Extract the player names
    		playerdata = data["first_name"] + " " + data["second_name"]
    		#Extract player team
    		teamname = data["team_name"]
    		#Extract player position
    		position = data["type_name"]
    		#Extract the players price
    		price = data["event_cost"]
    		#Percentage selected
    		selected = data["selected_by"]
    		#Open the file using the io.open with encoding='utf8' to counteract irregualr characters
    		myfile = io.open("player_history.txt", "a", encoding='utf8')
    		
    		#Append the data to the file
    		for datapoint in scoredata:
    			mystring = str(datapoint)
    			#Clean the data strings
    			mystring1 = mystring.replace("[", "")
    			mystring2 = mystring1.replace("u'", "")
    			mystring3 = mystring2.replace("]", "")
    			mystring4 = mystring3.replace("'", "")
    			#Write the data to the file
    			myfile.write(mystring4 + "," + playerdata + "," + teamname + "," + position + "," + selected + "," + str(price) + ',' + str(i)  + "\n")
    	except:
    		#Write all of the numbers for which there was errors to a file
    		errfile = open("errfile.txt", "a")
    		errfile.write(str(i) + "\n")
    		pass
    		
    	print i
    	i += 1
    
    print "Player data scraped"
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2007
    Location
    Joensuu, Finland
    Posts
    471
    Rep Power
    71
    You have a funny way of using files: first you open them, then immediately close them, then open and close for each read or write.

    I would move all open() calls to before the “while” loop and close() each file after the loop. The same goes for urllib.urlopen() which is now needlessly called once for every iteration of the “while” loop.

    Comments on this post

    • Will-O-The-Wisp agrees : Thanks for helping out here!
    My armada: Debian GNU/Linux 8 (desktop, home laptop, work laptop), Raspbian GNU/Linux 8 (nameserver), Ubuntu 14.04.3 LTS (HTPC), PC-BSD 10.2 (testbed), Android 4.2.1 (tablet)
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2015
    Posts
    2
    Rep Power
    0
    Originally Posted by SuperOscar
    You have a funny way of using files: first you open them, then immediately close them, then open and close for each read or write.

    I would move all open() calls to before the “while” loop and close() each file after the loop. The same goes for urllib.urlopen() which is now needlessly called once for every iteration of the “while” loop.
    Thanks very much for your reply. I'm a complete novice at this but will certainly give your advice a try. The issue seems to me to be related to the way the data is either appended or written to the csv file.

    Thanks again,
    Rob

IMN logo majestic logo threadwatch logo seochat tools logo