June 30th, 2013, 03:18 PM
I have another question about codecs.
From my understanding, encode means taking a Unicode string and convert it to bytes of a specified character encoding, for example str.encode(utf8).
Decode, means to take an already encoded string, and decode it to a Unicode string with code points, correct?
So I have this simple function (got it from a book) and I'm not sure what the decode() method does here.
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
texto = seek()
When I run this code, I get the HTML found on http://wecloudforyou.com/, exactly as it's seen on the website, with indentation and all.
<title>We Cloud for You |
If I remove decode("utf8") from the function, I get the HTML but with no indentation and I see a lot of "\n" all around the code:
<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You |
So, if decode is to take some encoded string and convert it to unicode, why using decode("utf8") returns a perfect "copy" of the HTML code but removing it, returns the same code but with no indentation or these weird "\n" everywhere?
I would appreciate if someone could confirm that my understanding of encode vs decode is correct and why I'm seen this behavior when using decode().
Thank you very much.
June 30th, 2013, 04:45 PM
If you don't decode, you're getting binary data (a bytestring), not text. For historical reasons, the repr for bytestrings displays them as ASCII text, but they quite different from text strings in Python 3.
An encoding is a way of transforming text into bytes. UTF-8 is a popular encoding. When you decode("utf-8") you turn the bytes back into text.
This is highly oversimplified. Here is an article that explains it in a bit more depth.
June 30th, 2013, 05:13 PM
I'm familiar with the article, but that doesn't explain why I get non-indented text and "\n" all over the text. (Unless I'm not understanding the concepts well)
I don't know if it's a problem with the character encoding that urllib.request.openurl() uses, I couldn't find any docs on this.
According to Python docs, read() returns a string.
Read()...reads some quantity of data and returns it as a string.
So, according to what I read, decode is just the opposite of encode. So to decode, is to get an encoded string and convert it to unicode, but again, I could be wrong.
I appreciate any help. Thanks!
June 30th, 2013, 07:10 PM
I suppose I buried the answer in there a bit, sorry. The reason is: because it is not a string! The "\n" you see is a byte 0A, which in ASCII and UTF-8 is a newline character. As such, when you decode to a text string and print it, you see a newline.
Originally Posted by pgonzaleznet
In fact, the \n escape sequence is used to represent newlines in text strings as well:
It just doesn't appear when printing them:
June 30th, 2013, 09:21 PM
After posting the original question I figured out the issue with the "\n" character.
After doing some research, I found the following.
My sys.stdout.encoding is CP1252 (Windows 1252 encoding)
According to what I read somewhere:
"Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's."
So, I believe that python is just getting the text as raw data and that includes then "\n".
If I tell it to decode(utf8) it then interprets that "\n" is a new line and that's why the indentation shows up when the code is printed.
"\n" is also supported on CP1252, so it's displayed correctly on the screen.
Thank you very much for your help. Please do let me know if my understanding of this is wrong or something, thanks!