Unicode
Unicode is a standard for text encoding.
It defines a mapping of integers to characters in various languages (code-points).
Various text-encodings alter how the integer is divided across byte(s),
but regardless of it's composition, the assigned number/character is constant.
For example, UTF-8 uses the first 1-5 bits of a byte to indicate the type of byte, and if the number spans multiple bytes.
The remaining bits are assembled into one large integer, that refers to a code-point/character.
UTF-1,7,8,16,32 all map to the same character set defined by unicode.
Documentation
wikipedia: unicode https://en.wikipedia.org/wiki/Unicode wikipedia: utf-8 https://en.wikipedia.org/wiki/UTF-8
Encodings
UTF-1
UTF-7
UTF-8
Predominantly used by the internet.
- reverse compatible with ASCII
- multiple bytes may be used for characters (1-4 bytes may be used)
- UTF-8 is extendable, we can add a 5th byte later if we need it.
01111111 # ascii only uses 7-bits, the leading bit is always 0 1....... # utf-8 puts a '1' in the leading bit to indicate a multi-byte character # the number of leading '1's in the first byte of a multi-byte character # indicates how many following bytes will be used to represent the character # (this one will have 3x bytes after it) 1110.... # each following byte will have the prefix '10', indicating it is a 'continuation byte' 10...... # a byte beginning with '0' indicates this is ascii, and no following bytes are required 0.......When looking up a multibyte character, all bytes are combined into one very large binary integer.
# byte-1 byte-2 byte-3 byte-4 [1110]1001 [10]100100] [10]010010 [10]000000 # combine to the binary number 1001 100100 010010 000000 1001100100010010000000 # in decimal 2507904UTF-16
Used by windows.
UTF-32
UTF-32 always uses 4-bytes for every character.
Like ASCII, it is a 1:1 mapping between characters and code-points.
# code-point for 'd' is 100 # in binary, this is 0b1100100 # in UTF-32 this is # [byte-1] [byte-2] [byte-3] [byte-4] 0b00000000 0b00000000 0b00000000 0b01100100