#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    4
    Rep Power
    0

    Reading Unicode data from server and writing to a file


    Hello,
    I am fetching some strings from a server which have one utf-8 character. What I need to do is split on that UTF character and store the parts separately at 2 different places.
    For example: 詳細2.3

    The '' in the above string is nothing but U+F8FF unicode character (UTF8: EF A3 BF)

    I need to split the string on unicode character.
    Here is my code:
    ---------------------------------------
    text = open(r"C:\tempchar.txt").read()

    newpart = text.decode('utf-8').split(u"\uf8ff")
    firstpart = newpart[::2] #some manipulation on this later
    secondpart = newpart[1::2] #some manipulation on this later

    --------------------------------------
    When I try this on a sample string from the text file as done in the code above, it works fine. And I can print the text on cmd prompt.

    But, when I do the same thing with the input string from a server, I get the following error:
    "UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

    I have never seen this error before and have no idea what could be causing this. PLEASE HELP!
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,854
    Rep Power
    481
    I would think that if you read information into a_variable from the server and showed its type

    print(type(a_variable))

    the answer would differ from the type

    print(type(open(r"C:\tempchar.txt").read()))

    Once you knew this you'd have the clues you need to find the solution.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    4
    Rep Power
    0
    Thanks for your reply.
    It shows <type 'unicode'> for all strings that I read from the server.
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,854
    Rep Power
    481
    Great, you performed half of the experiment. The half I couldn't.

    You're using python2. We know this because you have a unicode string that's invalid in python3.

    The data type resulting from read in your test is str.

    >>> print(type(open('/tmp/c.c').read()))
    <type 'str'>

    Got it? Your server gives unicode, your console example uses str. This python2 statement works as I'd expect:

    >>> u'詳細2.3'.split(u'')
    [u'\u8a73\u7d30', u'2.3']


    Note to vision impaired yet gentle readers: The bit that appears as an empty string contains a narrow space character:

    >>> ord(u'')
    63743
    Last edited by b49P23TIvg; November 16th, 2012 at 08:29 AM.
    [code]Code tags[/code] are essential for python code and Makefiles!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    4
    Rep Power
    0
    I see. However, the test with the input file works just perfect.

    The error message appears only when I read the strings from the server in my code.
    The error message says ""UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)"

IMN logo majestic logo threadwatch logo seochat tools logo