#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    390
    Rep Power
    48

    ANSI file type with UTF-8 charset?


    Hello everyone.

    I have a little question:

    I have a webpage with French text and I saved the file with the ANSI encoding using notepad, and I set the charset in the code to this: <meta charset="utf-8" />.

    But when I view the page in the browser, some of the characters do not show up. Why is that? They do show properly when I change the charset.

    I really want to understand what is going on, just for the sake of my own understanding. Why should the UTF-8 charset break things in this case?

    Thanks.
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,068
    Rep Power
    9398
    If you used any characters with accents, graves, cedillas, or pretty much anything not found on a US keyboard, then it's not ASCII compatible and you need to use UTF-8.

    What characters specifically?
  4. #3
  5. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,127
    Rep Power
    4304
    ANSI and UTF-8 are different. They encode characters beyond the lower range of characters differently.

    Here's a good article on it: The Definitive Guide to Web Character Encoding

    And here's one about ANSI: http://stackoverflow.com/questions/7...is-ansi-format

    These days it's best to use UTF-8 for everything, if you can. (Some older programs don't support Unicode/UTF-8.)
    Last edited by Kravvitz; April 15th, 2013 at 11:04 PM.
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    390
    Rep Power
    48
    I know that ANSI and UTF-8 are different. French I believe falls under ANSI safely. I am making a French ebook (using HTML) and the ereader only uses ANSI.

    If I use both ANSI for the file encoding and the charset the French characters show up properly on the web browsers but if I change the encoding to UTF-8, things go wrong. Why would that happen if UFT-8 covers more charscters than ANSI?
  8. #5
  9. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,127
    Rep Power
    4304
    The second sentence of my previous reply is the answer to that. The upper range of characters in "ANSI" (128-255) are not the same for a given code point in Unicode. Unicode maps such characters differently.
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  10. #6
  11. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,068
    Rep Power
    9398
    As Kravvitz's ANSI link says, ANSI is not an encoding but rather a set of encodings. Just like Unicode. If you're using "ANSI encoding" then you're probably using Windows-1252.
  12. #7
  13. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    No, Unicode is not an encoding or a set of encodings. It's a character set, which maps characters to code points. And to pack these code points into bytes, there are several alternative encoding schemes like UTF-8, UTF-16, UTF-32 or UTF-7. They all encode the Unicode characters, but in a different way.

    I think the confusion of character sets and encodings is the actual problem here. The character set Unicode is indeed mostly compatible to the "ANSI" character set (or Windows‑1252). With a few exceptions, they share the same code points for their common characters. But the encoding UTF-8 is fundamentally different from the Windows‑1252 encoding.

    In fact, a Windows‑1252 encoded string usually isn't even valid input for a UTF-8 decoder. Windows‑1252 is a simple single byte encoding that directly uses the binary representation of the code points. So the code point 0 would be encoded as 00000000, 1 would be 00000001 etc. On the other hand, UTF-8 is a variable-width multibyte encoding.

    As a concrete example:

    The character "ť" has the code point 233 in both the Windows‑1252 and the Unicode character set.

    The Windows‑1252 encoding directly uses the binary representation of 233, so the character is encoded as

    11101001

    UTF-8 needs two bytes to encode the character, because the first two bits of every byte are used to mark multibyte sequences and store their length. The output is

    11000011 10101001

    with the actual "payload" being highlighted in red.

    To explain the pattern: The first byte of a multibyte sequence starts with "11". The number of "1" at the beginning indicate the number of bytes. In this case you have two bytes, so it's "110...". All other bytes of the sequence start with "10". The remainding bits are used for the (right-padded) binary representation of the code point. Hence 11000011 10101001.

    As you can see, the result is completely different, despite both character sets sharing the same code point 233 for this character. So when you declare an Windows‑1252 string as UTF-8, you'll get total garbage. The encoder will not even recognize the number of characters correctly, because it thinks every byte starting with "1" belongs to a multibyte sequence.

    The only exception is when you only have ASCII characters. Both UTF-8 and the Windows‑1252 encoding are ASCII compatible, so they output the same bit patterns for ASCII characters -- however, in that case you'd rather declare the whole thing as ASCII in the first place.

    Long story short: The character sets Unicode and Windows‑1252 are roughly compatible. The encodings UTF-8 and Windows‑1252 are not. Confusing them leads to garbage output.

    Fortunately, UTF-8 is on its way of becoming the standard encoding on the internet. This should finally put an end to the encoding chaos. The only problem is that many crap tutorials still use "ANSI" and that many people still copy and paste their code from those tutorials -- and that some eReaders just don't support Unicode.
    Last edited by Jacques1; April 20th, 2013 at 07:11 PM.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".

IMN logo majestic logo threadwatch logo seochat tools logo