No, Unicode is not an encoding or a set of encodings. It's a character set, which maps characters to code points. And to pack these code points into bytes, there are several alternative encoding schemes like UTF-8, UTF-16, UTF-32 or UTF-7. They all encode the Unicode characters, but in a different way.
I think the confusion of character sets and encodings is the actual problem here. The character set
Unicode is indeed mostly compatible to the "ANSI" character set (or Windows‑1252). With a few exceptions, they share the same code points for their common characters. But the encoding
UTF-8 is fundamentally different from the Windows‑1252 encoding.
In fact, a Windows‑1252 encoded string usually isn't even valid input for a UTF-8 decoder. Windows‑1252 is a simple single byte encoding that directly uses the binary representation of the code points. So the code point 0 would be encoded as 00000000, 1 would be 00000001 etc. On the other hand, UTF-8 is a variable-width multibyte encoding.
As a concrete example:
The character "é" has the code point 233 in both the Windows‑1252 and the Unicode character set.
The Windows‑1252 encoding directly uses the binary representation of 233, so the character is encoded as
UTF-8 needs two bytes to encode the character, because the first two bits of every byte are used to mark multibyte sequences and store their length. The output is
with the actual "payload" being highlighted in red.
To explain the pattern: The first byte of a multibyte sequence starts with "11". The number of "1" at the beginning indicate the number of bytes. In this case you have two bytes, so it's "110...". All other bytes of the sequence start with "10". The remainding bits are used for the (right-padded) binary representation of the code point. Hence 11000011 10101001.
As you can see, the result is completely different, despite both character sets sharing the same code point 233 for this character. So when you declare an Windows‑1252 string as UTF-8, you'll get total garbage. The encoder will not even recognize the number of characters correctly, because it thinks every byte starting with "1" belongs to a multibyte sequence.
The only exception is when you only have ASCII characters. Both UTF-8 and the Windows‑1252 encoding are ASCII compatible, so they output the same bit patterns for ASCII characters -- however, in that case you'd rather declare the whole thing as ASCII in the first place.
Long story short: The character sets
Unicode and Windows‑1252 are roughly compatible. The encodings
UTF-8 and Windows‑1252 are not
. Confusing them leads to garbage output.
Fortunately, UTF-8 is on its way of becoming the standard encoding on the internet. This should finally put an end to the encoding chaos. The only problem is that many crap tutorials still use "ANSI" and that many people still copy and paste their code from those tutorials -- and that some eReaders just don't support Unicode.