#1
  1. Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jun 2003
    Location
    Thessaloniki
    Posts
    1,284
    Rep Power
    13

    Encoding issue when renaming files from greek_bytes to utf8_bytes


    All happened when using FileZilla to upload greek filenames to my remote linux server and putty as an ssh cleint, using greek-iso as a locale encoding setting, because win8 used that by default.

    Everything work when filenames in the directorry are ngleish file names.
    IF i rename an eglish filename to greek filename i get the error that shows upo at the end my post.

    I know you guys know linu and there is a good chance you know python too, so you can help me out.

    thank you.


    Code:
    #====================
    # Collect directory and its filenames as bytes
    path = b'/home/nikos/public_html/data/apps/'
    files = os.listdir( path )
    
    for filename in files:
    	# Compute 'path/to/filename'
    	filepath_bytes = path + filename
    	for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
    		try: 
    			filepath = filepath_bytes.decode( encoding )
    		except UnicodeDecodeError:
    			continue
            
    		# Rename to something valid in UTF-8 
    		if encoding != 'utf-8': 
    			os.rename( filepath_bytes, filepath.encode('utf-8') )
    
    		assert os.path.exists( filepath )
    		break 
    	else: 
    		# This only runs if we never reached the break
    		raise ValueError( 'unable to clean filename %r' % filepath_bytes ) 
    
    
    #========================================================
    # Collect filenames of the path dir as strings
    filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
    
    # Load'em
    for filename in filenames:
    	try:
    		# Check the presence of a file against the database and insert if it doesn't exist
    		cur.execute('''SELECT url FROM files WHERE url = %s''', (filename,) )
    		data = cur.fetchone()
    		
    		if not data:
    			# First time for file; primary key is automatic, hit is defaulted
    			print( "iam here", filename + '\n' )
    			cur.execute('''INSERT INTO files (url, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, lastvisit) )
    	except pymysql.ProgrammingError as e:
    		print( repr(e) )
    
    
    #========================================================
    # Collect filenames of the path dir as strings
    filenames = os.listdir( '/home/nikos/public_html/data/apps/' )
    filepaths = set()
    
    # Build a set of 'path/to/filename' based on the objects of path dir
    for filename in filenames:
    	filepaths.add( filename )
    
    # Delete spurious 
    cur.execute('''SELECT url FROM files''')
    data = cur.fetchall()
    
    # Check database's filenames against path's filenames
    for rec in data:
    	if rec not in filepaths:
    		cur.execute('''DELETE FROM files WHERE url = %s''', rec )
    When trying to runt he above i get:

    Code:
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Original exception was:, referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] Traceback (most recent call last):, referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File "/home/nikos/public_html/cgi-bin/files.py", line 83, in <module>, referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]     assert os.path.exists( filepath ), referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]   File "/usr/local/lib/python3.3/genericpath.py", line 18, in exists, referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173]     os.stat(path), referer: http://superhost.gr/
    [Sun Jun 09 09:37:51 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-37: ordinal not in range(128), refere
    Why am i still receing unicode decore errors?
    i have write a prodecure just to avoid decoding issues and rename all greek_bytes filenames to utf-8_bytes.

    Can you help please?
    Last edited by Nik; June 10th, 2013 at 02:28 AM.
    What is now proved was once only imagined!

IMN logo majestic logo threadwatch logo seochat tools logo