While text is one of simplest form of data a computer can manipulate, it also one of the most misunderstood: many competent computer scientist get confused by unicode and various encodings. One reason for this is the original sin of text processing: assuming that character is a byte. This assumption is encoded in many languages (C, python) and in the mind of many programmers and the cause of many bugs.
At the core, a computer does not know what a character is, it just manipulates numbers, so one has to build a convention that this number is that character. This is largely arbitrary, one encoding might decide that ‘M’ should be encoded
0xD4. Alphabetical order, Baudot Code, ASCII, or EBCDIC: choose your standard.
This is the reason why Unix system calls like
open let you specify if you want to open your file in text mode or binary mode (using the
b mode). The problem of converting between different binary encodings of text already existed at that point in time, but this has been all but forgotten, and noways there is no difference between the binary and the text mode.
EBCDIC and Baudot code faded away, and ASCII and its variants came to dominate. ASCII assigns character a number in the
0x7F range, which is fine for the characters commonly used in the English language, and sucks for mostly everything else. Also note that the way these 7 bit characters are encoded on systems that typically work with bytes of 8 bits is pretty wasteful, the 8th bit is just left blank. Technically, one could encode 8 ASCII characters in 7 bytes. Interestingly, among the 16 first code points, only
0x0D is actually used, all the others were not meant to represent text, but the control of a teletype, and have died out.
So Unicode was invented, it assigns each character (for some definition of character) a number which currently goes up to
0x300000. This fixed the question of which character has which number. The problem now is: how to map a sequence of numbers that can be pretty large to a bunch of bytes. Basically, this is about choosing an encoding.
One way to solve the problem is by reproducing the original sin of ASCII and just decide that some range is good enough, and move over. For instance Latin-1 defines that characters in the unicode range
0xFF are mapped to the equivalent bytes, this range contains most of the western european characters. Done. This had the advantage that all valid ASCII encoded text would also be valid iso-latin-1 text.
UCS-2 took the same approach, but at a greater scale, all codes in the
0xFFFF were mapped to the equivalent pairs of bytes, which meant, of course that there would two variants of UCS-2: big-endian and small-endian. A special character called the byte order mark (BOM) was added to detect which was which.
UTF-8 is probably the smartest encoding of code points into bytes, it is a variable byte length encoding with nice properties: values in the ASCII range are encoded as such, i.e. the encoding is compatible with ASCII, all code-points above
0x80 are encoded as multi-byte sequences. The format of each byte makes it possible to determine its position in a multi-byte sequence. Note that UTF-8 is less efficient than Latin-1 for european texts (non-ASCII characters require two bytes instead of one), and pretty bad for asian text (3 bytes per character instead of 2). UTF-8 also makes text manipulation difficult, as the position of each character in the stream is determined by the nature of the previous characters.
0xFFFF are represented using a special encoding on two 16 bits values, called surrogate pairs. This means that the code-point
0x10000 is encoded on 32 bits.
The main source of confusion I have observed with computer scientists is that they mix up string objets in the memory of a computer and their byte representation. It is quite common for a program to use the language’s representation of string in memory and a different byte representation when doing I/O on that data. Also having a bunch of bytes of type text does not tell you in which representation said text is, on a web-page, or in an e-mail, there are headers that specify this, but not with the typical operating system files.