April 15th, 2013, 09:59 PM
ANSI file type with UTF-8 charset?
I have a little question:
I have a webpage with French text and I saved the file with the ANSI encoding using notepad, and I set the charset in the code to this: <meta charset="utf-8" />.
But when I view the page in the browser, some of the characters do not show up. Why is that? They do show properly when I change the charset.
I really want to understand what is going on, just for the sake of my own understanding. Why should the UTF-8 charset break things in this case?
April 15th, 2013, 11:13 PM
If you used any characters with accents, graves, cedillas, or pretty much anything not found on a US keyboard, then it's not ASCII compatible and you need to use UTF-8.
What characters specifically?
April 15th, 2013, 11:59 PM
April 20th, 2013, 01:15 PM
I know that ANSI and UTF-8 are different. French I believe falls under ANSI safely. I am making a French ebook (using HTML) and the ereader only uses ANSI.
If I use both ANSI for the file encoding and the charset the French characters show up properly on the web browsers but if I change the encoding to UTF-8, things go wrong. Why would that happen if UFT-8 covers more charscters than ANSI?
April 20th, 2013, 04:44 PM
The second sentence of my previous reply is the answer to that. The upper range of characters in "ANSI" (128-255) are not the same for a given code point in Unicode. Unicode maps such characters differently.
April 20th, 2013, 06:07 PM
As Kravvitz's ANSI link says, ANSI is not an encoding but rather a set of encodings. Just like Unicode. If you're using "ANSI encoding" then you're probably using Windows-1252.
April 20th, 2013, 08:08 PM
No, Unicode is not an encoding or a set of encodings. It's a character set, which maps characters to code points. And to pack these code points into bytes, there are several alternative encoding schemes like UTF-8, UTF-16, UTF-32 or UTF-7. They all encode the Unicode characters, but in a different way.
I think the confusion of character sets and encodings is the actual problem here. The character set Unicode is indeed mostly compatible to the "ANSI" character set (or Windows‑1252). With a few exceptions, they share the same code points for their common characters. But the encoding UTF-8 is fundamentally different from the Windows‑1252 encoding.
In fact, a Windows‑1252 encoded string usually isn't even valid input for a UTF-8 decoder. Windows‑1252 is a simple single byte encoding that directly uses the binary representation of the code points. So the code point 0 would be encoded as 00000000, 1 would be 00000001 etc. On the other hand, UTF-8 is a variable-width multibyte encoding.
As a concrete example:
The character "ť" has the code point 233 in both the Windows‑1252 and the Unicode character set.
The Windows‑1252 encoding directly uses the binary representation of 233, so the character is encoded as
UTF-8 needs two bytes to encode the character, because the first two bits of every byte are used to mark multibyte sequences and store their length. The output is
with the actual "payload" being highlighted in red.
To explain the pattern: The first byte of a multibyte sequence starts with "11". The number of "1" at the beginning indicate the number of bytes. In this case you have two bytes, so it's "110...". All other bytes of the sequence start with "10". The remainding bits are used for the (right-padded) binary representation of the code point. Hence 11000011 10101001.
As you can see, the result is completely different, despite both character sets sharing the same code point 233 for this character. So when you declare an Windows‑1252 string as UTF-8, you'll get total garbage. The encoder will not even recognize the number of characters correctly, because it thinks every byte starting with "1" belongs to a multibyte sequence.
The only exception is when you only have ASCII characters. Both UTF-8 and the Windows‑1252 encoding are ASCII compatible, so they output the same bit patterns for ASCII characters -- however, in that case you'd rather declare the whole thing as ASCII in the first place.
Long story short: The character sets Unicode and Windows‑1252 are roughly compatible. The encodings UTF-8 and Windows‑1252 are not. Confusing them leads to garbage output.
Fortunately, UTF-8 is on its way of becoming the standard encoding on the internet. This should finally put an end to the encoding chaos. The only problem is that many crap tutorials still use "ANSI" and that many people still copy and paste their code from those tutorials -- and that some eReaders just don't support Unicode.
Last edited by Jacques1; April 20th, 2013 at 08:11 PM.