#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    23
    Rep Power
    0

    Encoding question


    Hi there,

    I have another question about codecs.

    From my understanding, encode means taking a Unicode string and convert it to bytes of a specified character encoding, for example str.encode(utf8).

    Decode, means to take an already encoded string, and decode it to a Unicode string with code points, correct?

    So I have this simple function (got it from a book) and I'm not sure what the decode() method does here.


    python Code:
    import urllib.request
     
    def seek():
        web = urllib.request.urlopen("http://wecloudforyou.com/")
        text = web.read().decode("utf8")
        return text
    texto = seek()
    print(texto)


    When I run this code, I get the HTML found on http://wecloudforyou.com/, exactly as it's seen on the website, with indentation and all.

    HTML Code:
    <!DOCTYPE html>
    <html>
        <head>
           <title>We Cloud for You |


    If I remove decode("utf8") from the function, I get the HTML but with no indentation and I see a lot of "\n" all around the code:

    HTML Code:
    <!DOCTYPE html>\n<html>\n    <head>\n       <title>We Cloud for You |


    So, if decode is to take some encoded string and convert it to unicode, why using decode("utf8") returns a perfect "copy" of the HTML code but removing it, returns the same code but with no indentation or these weird "\n" everywhere?

    I would appreciate if someone could confirm that my understanding of encode vs decode is correct and why I'm seen this behavior when using decode().

    Thank you very much.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    114
    Rep Power
    3
    If you don't decode, you're getting binary data (a bytestring), not text. For historical reasons, the repr for bytestrings displays them as ASCII text, but they quite different from text strings in Python 3.

    An encoding is a way of transforming text into bytes. UTF-8 is a popular encoding. When you decode("utf-8") you turn the bytes back into text.

    This is highly oversimplified. Here is an article that explains it in a bit more depth.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    23
    Rep Power
    0
    Thanks,

    I'm familiar with the article, but that doesn't explain why I get non-indented text and "\n" all over the text. (Unless I'm not understanding the concepts well)

    I don't know if it's a problem with the character encoding that urllib.request.openurl() uses, I couldn't find any docs on this.

    According to Python docs, read() returns a string.

    http://docs.python.org/2/tutorial/inputoutput.html

    Read()...reads some quantity of data and returns it as a string.

    So, according to what I read, decode is just the opposite of encode. So to decode, is to get an encoded string and convert it to unicode, but again, I could be wrong.

    I appreciate any help. Thanks!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    114
    Rep Power
    3
    Originally Posted by pgonzaleznet
    I'm familiar with the article, but that doesn't explain why I get non-indented text and "\n" all over the text. (Unless I'm not understanding the concepts well)
    I suppose I buried the answer in there a bit, sorry. The reason is: because it is not a string! The "\n" you see is a byte 0A, which in ASCII and UTF-8 is a newline character. As such, when you decode to a text string and print it, you see a newline.

    In fact, the \n escape sequence is used to represent newlines in text strings as well:
    Code:
    >>> "Hello\nWorld"
    'Hello\nWorld'
    It just doesn't appear when printing them:
    Code:
    >>> print("Hello\nWorld")
    Hello
    World
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    23
    Rep Power
    0
    Hi thanks,

    After posting the original question I figured out the issue with the "\n" character.

    After doing some research, I found the following.

    My sys.stdout.encoding is CP1252 (Windows 1252 encoding)

    According to what I read somewhere:

    "Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's."


    So, I believe that python is just getting the text as raw data and that includes then "\n".

    If I tell it to decode(utf8) it then interprets that "\n" is a new line and that's why the indentation shows up when the code is printed.

    "\n" is also supported on CP1252, so it's displayed correctly on the screen.

    Thank you very much for your help. Please do let me know if my understanding of this is wrong or something, thanks!

IMN logo majestic logo threadwatch logo seochat tools logo