#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    3
    Rep Power
    0

    Data conversion issue


    I'm in need of help for the following code. I have been debugging my script and I have narrowed down my issue to a few lines of code that I believe to be causing my problem. I'm reading in data from 3 csv files, extracting data from a sproc in SQL Server, and exporting the data from the source files and database to draw comparison and validation line by line. The script matches the database on th district_key and if there is any missing data (district_name) it provides that data. The problem I'm getting is my source file is producing duplicates (one row from each source file). I'm having string conversion issues wth the district_codes in two of the files (teacher_frame, financial_frame). In order to match the district_key in the db I concatenate district_code and distrct_type_code. This should have a lengt of 6 but produces 3 different values (012396, 1.02396, 12396) for each frame. Please help...

    Code:
    parser=argparse.ArgumentParser(description='Process a data audit about a state',prefix_chars='-+')
    #add state, year, enrollment file, teacher file, and financial file arguments
    parser.add_argument('-s',help='Specify the state for the data audit')
    parser.add_argument('-y',help='Specify the year for the data audit')
    #parser.add_argument('-d',help='Specify length of the district code')
    parser.add_argument('-e',help='Specify the enrollment source file')
    parser.add_argument('-t',help='Specify the teacher source file')
    parser.add_argument('-f',help='Specify the financial source file')
    args=parser.parse_args()
    #receive the results of the command line arguments: state, year, enrollment file, teacher file and financial file arguments 
    state=args.s
    year=args.y
    #districtformat=args.d
    enrollment=args.e
    teacher=args.t
    financial=args.f
    #strip all extra spaces of entered data
    state=state.strip()
    year=year.strip()
    #districtformat=districtformat.strip()
    enrollment=enrollment.strip()
    teacher=teacher.strip()
    financial=financial.strip()
    
    #districtformat=int(districtformat)
    
    print enrollment, teacher, financial
    
    #read in enrollment data into pandas dataframe
    df2=pd.read_csv(enrollment,dtype={'district_code':object})
    #read in teacher data into pandas dataframe
    teacherframe=pd.read_csv(teacher,dtype={'district_code':object})
    #read in financial data into pandas dataframe
    financialframe=pd.read_csv(financial,dtype={'district_code':object})
    
    df2orig=df2
    
    #filter files by year
    df2=df2[df2['year'].isin([int(year)])]
    teacherframe=teacherframe[teacherframe['year'].isin([int(year)])]
    financialframe=financialframe[financialframe['year'].isin([int(year)])]
    
    dfyearfound=len(df2)!=0
    teacheryearfound=len(teacherframe)!=0
    financialyearfound=len(financialframe)!=0
    
    ##########
    #Make sure files contain both district names and district codes
    #If not, match with values in other files 
    
    #strip extra spaces from the district names
    if dfyearfound:
    	df2['district_name']=df2['district_name'].str.strip()
    	df2['district_code']=df2['district_code'].str.strip()
    	#df2['district_type_code']=df2['district_type_code'].str.strip()
    	print df2['district_code'][df2.index[0]]
    	#print df2['district_type_code'][df2.index[0]]
    if financialyearfound:
    	financialframe['district_code']=financialframe['district_code'].str.strip()
    	financialframe['district_name']=financialframe['district_name'].str.strip()
    	#financialframe['district_type_code']=financialframe['district_type_code'].str.strip()
    	print financialframe['district_code'][financialframe.index[0]]
    	#print financialframe['district_type_code'][financialframe.index[0]]
    if teacheryearfound:
    	teacherframe['district_name']=teacherframe['district_name'].str.strip()
    	teacherframe['district_code']=teacherframe['district_code'].str.strip()
    	#teacherframe['district_type_code']=teacherframe['district_type_code'].str.strip()
    	print teacherframe['district_code'][teacherframe.index[0]]
    	#print teacherframe['district_type_code'][teacherframe.index[0]]
    
    #convert district codes to strings
    if dfyearfound:
    	df2['district_code']=df2['district_code'].apply(lambda x: str(x))
    	#print df2['district_code'][df2.index[0]]
    	df2['district_type_code']=df2['district_type_code'].apply(lambda x: str(x))
    	#print df2['district_type_code'][df2.index[0]]
    if teacheryearfound:
    	teacherframe['district_code']=teacherframe['district_code'].apply(lambda x: str(x))
    	#print teacherframe['district_code'][teacherframe.index[0]]
    	teacherframe['district_type_code']=teacherframe['district_type_code'].apply(lambda x: str(x))
    	#print teacherframe['district_type_code'][teacherframe.index[0]]
    if financialyearfound:
    	financialframe['district_code']=financialframe['district_code'].apply(lambda x: str(x))
    	#print financialframe['district_code'][financialframe.index[0]]
    	financialframe['district_type_code']=financialframe['district_type_code'].apply(lambda x: str(x))
    	#print financialframe['district_type_code'][financialframe.index[0]]
    
    #Pad extra zeros on district codes
    """
    #function for formating district codes
    def formatDistrictCodes(code):
    	Function that formats district codes.  Pass in a district code which will convert to a string with n digits
    	
    	dist=code
    	.zfill(4)
    	
    	return dist
    	
    	formatDistrictCodes(districtformat) 	
    """	
    #pad extra zeros
    
    if dfyearfound:
    	df2['district_code']=df2['district_code'].apply(lambda x: str(x,))
    	#print df2['district_code'][df2.index[0]]
    if teacheryearfound:
    	teacherframe['district_code']=teacherframe['district_code'].apply(lambda x: str(x,))
    	#print teacherframe['district_code'][teacherframe.index[0]]
    if financialyearfound:
    	financialframe['district_code']=financialframe['district_code'].apply(lambda x: str(x,))
    	#print financialframe['district_code'][financialframe.index[0]]
    
    
    #function for formating district codes
    
    
    	
    #find district name from district code
    districtnamedata={}
    
    #find district code from district name
    districtcodedata={}
    
    ###Get stored procedure data
    #perform connection to database
    cnxn2=connect('DRIVER={SQL Server}; SERVER=; DATABASE=; UID=; PWD='%(state))
    cursor2=cnxn2.cursor()
    cursor2.execute('exec dbo.data_audit_distbydist @year=%d'%(int(year)))
    rows2=cursor2.fetchall()
    
    #Dictionary that maps the stored procedure result to the type
    mapcolnum={'district_code':0, 'dist_key':1,'district_name':2,'school_count':3,'total_ftes':4,'computed_salaries':5,'enrollment_count':6,'revenue_sums':7,'budget_revenue':8,'expense_sums':9,'budget_expense':10,'asset_sums':11,'liab_sums':12,'eq_sums':13, 'other_sums':14}
    
    #construct a dictionaries which will be converted to series of stored procedure results
    sc={}
    tf={}
    cs={}
    ec={}
    rs={}
    br={}
    es={}
    be={}
    au={}
    ls={}
    eqs={}
    other={}
    #TODO ##################### Convert district_code to district_type_code ##########################
    for i in range(len(rows2)):
    	sc[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['school_count']]
    	tf[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['total_ftes']]
    	cs[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['computed_salaries']]
    	ec[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['enrollment_count']]
    	rs[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['revenue_sums']]
    	br[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['budget_revenue']]
    	es[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['expense_sums']]
    	be[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['budget_expense']]
    	au[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['asset_sums']]
    	ls[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['liab_sums']]
    	eqs[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['eq_sums']]
    	other[(rows2[i][mapcolnum['district_name']],rows2[i][mapcolnum['dist_key']])]=rows2[i][mapcolnum['other_sums']]
    
    #index of all district code and district name combinations
    distList=sc.keys()
    newindex=pd.MultiIndex.from_tuples(distList,names=['districtNamesR','districtTypeCodesR'])
    	
    #convert into series to be appended onto finaldataframe
    actualSchoolCount=pd.Series(sc,index=newindex)
    actualFTEs=pd.Series(tf,index=newindex)
    actualComputedSalaries=pd.Series(cs,index=newindex)
    actualEnrollment=pd.Series(ec,index=newindex)
    actualRevenue=pd.Series(rs,index=newindex)
    actualBudgetRevenue=pd.Series(br,index=newindex)
    actualExpense=pd.Series(es,index=newindex)
    actualBudgetExpense=pd.Series(be,index=newindex)
    actualAssetSums=pd.Series(au,index=newindex)
    actualLiabSums=pd.Series(ls,index=newindex)
    actualEquitySums=pd.Series(eqs,index=newindex)
    actualOtherSums=pd.Series(other,index=newindex)
    
    #make standard list of district names to district code as vice versa and store them in Python dictionaries
    for i in range(len(rows2)):
    	districtnamedata[rows2[i][0]]=rows2[i][1]
    	districtcodedata[rows2[i][1]]=rows2[i][0]
    
    	
    def addNamesCodes(testframe,districtnamedata,districtcodedata):
    	""" Function that will correct any missing data such as district names or district codes.  Parameter is a pandas dataframe, dictionaries which map the district names and district codes """
    
    	
    	#contain list of correct district codes and district names
    	districtnames=[]
    	districtkeys=[]
    	#Non matches
    	fdistrictnames=[]
    	fdistrictkeys=[]
    	#fill empty values in names and codes
    	
    	testframe['district_name']=testframe['district_name'].apply(lambda x: str(x))
    	testframe['district_name']=testframe['district_name'].fillna('')
    	testframe['district_code']=testframe['district_code'].fillna('')
    	testframe['district_type_code']=testframe['district_type_code'].fillna('')
    	testframe['dist_key']=testframe['district_type_code']+testframe['district_code']
    	#print testframe['dist_key'][testframe.index[0]]
    	#Create two new columns containing the district names and district codes in same format as enrollment and teacher data 	
    	for i in range(len(testframe.index)):
    		#both district code and district name are present
    		if districtnamedata.has_key(testframe['dist_key'][testframe.index[i]]) and districtcodedata.has_key(testframe['district_name'][testframe.index[i]]): 
    			#district code and district name are a match
    			if ((districtnamedata[testframe['dist_key'][testframe.index[i]]]==testframe['district_name'][testframe.index[i]]) and (districtcodedata[testframe['district_name'][testframe.index[i]]]==testframe['dist_key'][testframe.index[i]])):
    				districtnames.append(districtnamedata[testframe['dist_key'][testframe.index[i]]])
    				districtkeys.append(districtcodedata[testframe['district_name'][testframe.index[i]]])
    			#potential wrong mappings
    			else:
    				districtkeys.append(testframe['dist_key'][testframe.index[i]])
    				districtnames.append(districtnamedata[testframe['dist_key'][testframe.index[i]]])
    		else:
    			#check if district code is present
    			if districtnamedata.has_key(testframe['dist_key'][testframe.index[i]]):
    				districtkeys.append(testframe['dist_key'][testframe.index[i]])
    				districtnames.append(districtnamedata[testframe['dist_key'][testframe.index[i]]])
    			#check if only district name is present 
    			elif districtcodedata.has_key(testframe['district_name'][testframe.index[i]]):
    				districtnames.append(testframe['district_name'][testframe.index[i]])
    				districtkeys.append(districtcodedata[testframe['district_name'][testframe.index[i]]])
    			#complete nonmatches
    			else:
    				fdistrictnames.append(testframe['district_name'][testframe.index[i]])
    				fdistrictkeys.append(testframe['dist_key'][testframe.index[i]])
    	#extend the list by the complete nonmatches
    	districtnames.extend(fdistrictnames)
    	districtkeys.extend(fdistrictkeys)
    	
    	return districtnames,districtkeys
    	
    ########District Codes and District Names Data
    #add the missing district names and codes for enrollment and teacher
    if dfyearfound:
    	df2['districtNamesR'],df2['districtTypeCodesR']=addNamesCodes(df2,districtnamedata,districtcodedata)
    if teacheryearfound:
    	teacherframe['districtNamesR'],teacherframe['districtTypeCodesR']=addNamesCodes(teacherframe,districtnamedata,districtcodedata)
    if financialyearfound:
    	financialframe['districtNamesR'],financialframe['districtTypeCodesR']=addNamesCodes(financialframe,districtnamedata,districtcodedata)
    ###########Convert particular columns to proper data types
    #convert data into integers from string
    def intConvert(z):
    	"""Function used to convert values into integers and will be applied on the dataframe column
    	"""
    	if z!='':
    		return int(z)
    	else:
    		z=0
    		return z 
    
    #converts data into floats from string 		
    def floatConvert(z):
    	"""Function used to convert values into float and will be applied on the dataframe column"""
    	if z!='':
    		return float(z)
    	else:
    		z=0.0
    		return z
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,931
    Rep Power
    481
    I see your program has some print statements.
    Can you identify where the problem starts?
    Also, I recommend that the print statements show extra identifiers so you can tell which print statement produced what output.

    print 'a', 'whatever'
    print 'b', 'whatever'

    Code:
    if dfyearfound:
    	df2['district_name']=df2['district_name'].str.strip()
    	df2['district_code']=df2['district_code'].str.strip()
    	#df2['district_type_code']=df2['district_type_code'].str.strip()
    	print 'a',df2['district_code'][df2.index[0]]
    	#print df2['district_type_code'][df2.index[0]]
    if financialyearfound:
    	financialframe['district_code']=financialframe['district_code'].str.strip()
    	financialframe['district_name']=financialframe['district_name'].str.strip()
    	#financialframe['district_type_code']=financialframe['district_type_code'].str.strip()
    	print 'b',financialframe['district_code'][financialframe.index[0]]
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    3
    Rep Power
    0
    The solution was of two parts. The float duplicate key was being produced after the concatenation because of some null values in the district_type_code column in the source file. The second solution was to use zfill() to produce the necessary padding:

    Code:
    if dfyearfound:
    	df2['district_code'] = df2['district_code'].apply(lambda x: str(x).zfill(4))
    	df2['district_type_code'] = df2['district_type_code'].apply(lambda x: str(x).zfill(2))
    
    if teacheryearfound:
    	teacherframe['district_code'] = teacherframe['district_code'].apply(lambda x: str(x).zfill(4))
    	teacherframe['district_type_code'] = teacherframe['district_type_code'].apply(lambda x: str(x).zfill(2))
    	
    if financialyearfound:
    	financialframe['district_code'] = financialframe['district_code'].apply(lambda x: str(x).zfill(4))
    	financialframe['district_type_code'] = financialframe['district_type_code'].apply(lambda x: str(x).zfill(2))
    Although this works I hard coded the padding length. I would like to use this for multiple situations and parametrize zfill() with a parser argument -d. How would I go about doing that? I have the argument in place but commented out currently because I couldn't get it to work.

IMN logo majestic logo threadwatch logo seochat tools logo