Unicode: Difference between revisions
No edit summary |
|||
(20 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
| wikipedia || https://en.wikipedia.org/wiki/Unicode | | wikipedia: unicode || https://en.wikipedia.org/wiki/Unicode | ||
|- | |||
| wikipedia: utf-8 || https://en.wikipedia.org/wiki/UTF-8 | |||
|- | |||
| Unicode in C++ (video) || https://www.youtube.com/watch?v=tOHnXt3Ycfo | |||
|- | |- | ||
|} | |} | ||
</blockquote><!-- Documentation --> | </blockquote><!-- Documentation --> | ||
= Tutorials = | |||
<blockquote> | |||
{| class="wikitable" | |||
|- | |||
| tl;dr unicode || https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ | |||
|} | |||
</blockquote><!-- Tutorials --> | |||
= Endianness = | |||
<blockquote> | |||
Since unicode encodings use multiple bytes per character,<br> | |||
we need to know the order in which to read the bytes. | |||
The first 2-bytes of a file or text-stream contain the '''byte order mark'''<br> | |||
which indicate if this is big endian, or little endian. | |||
For more details, see [[endianness]]. | |||
<syntaxhighlight lang="yaml"> | |||
254 255 # (0xFEFF) big endian | |||
255 254 # (0xFFFE) little endian | |||
</syntaxhighlight> | |||
</blockquote><!-- Endianness --> | |||
= Encodings = | |||
<blockquote> | |||
== UTF-1 == | |||
<blockquote> | |||
</blockquote><!-- UTF-1 --> | |||
== UTF-7 == | |||
<blockquote> | |||
A proposed alternative to UTF-8. It is infrequently used, and denylisted from [[HTML|HTML5]].<br> | |||
See [https://en.wikipedia.org/wiki/UTF-7 wikipedia]. | |||
</blockquote><!-- UTF-7 --> | |||
== UTF-8 == | |||
<blockquote> | |||
Predominantly used by the internet. | |||
* the smallest unit is 8-bits | |||
* reverse compatible with [[ASCII]] | |||
* multiple bytes may be used for characters (1-4 bytes may be used) | |||
* UTF-8 is extendable, we can add a 5th byte later if we need it. | |||
* byte-order ''NOT'' determined by system's [[endianness]] | |||
<syntaxhighlight lang="bash"> | |||
01111111 # ascii only uses 7-bits, the leading bit is always 0 | |||
1....... # utf-8 puts a '1' in the leading bit to indicate a multi-byte character | |||
# the number of leading '1's in the first byte of a multi-byte character | |||
# indicates how many following bytes will be used to represent the character | |||
# (this one will have 3x bytes after it) | |||
1110.... | |||
# each following byte will have the prefix '10', indicating it is a 'continuation byte' | |||
10...... | |||
# a byte beginning with '0' indicates this is ascii, and no following bytes are required | |||
0....... | |||
</syntaxhighlight> | |||
When looking up a multibyte character, all bytes are combined into one very large binary integer. | |||
<syntaxhighlight lang="bash"> | |||
# byte-1 byte-2 byte-3 byte-4 | |||
[1110]1001 [10]100100] [10]010010 [10]000000 | |||
# combine to the binary number | |||
1001 100100 010010 000000 | |||
1001100100010010000000 | |||
# in decimal | |||
2507904 | |||
</syntaxhighlight> | |||
</blockquote><!-- UTF-8 --> | |||
== UTF-16 == | |||
<blockquote> | |||
Predominantly used by windows. | |||
* the smallest unit is 16-bits | |||
* all characters use either 2-bytes, or 4-bytes | |||
* not compatible with [[ASCII]] - it must be re-encoded | |||
* byte order determined by [[endianness]] | |||
</blockquote><!-- UTF-16 --> | |||
== UTF-32 == | |||
<blockquote> | |||
UTF-32 always uses 4-bytes for every character.<br> | |||
Like [[ASCII]], it is a 1:1 mapping between characters and code-points. | |||
* the smallest unit is 32-bits | |||
* byte order determined by [[endianness]] | |||
<syntaxhighlight lang="python"> | |||
# code-point for 'd' is 100 | |||
# in binary, this is | |||
0b1100100 | |||
# in UTF-32 this is | |||
# [byte-1] [byte-2] [byte-3] [byte-4] | |||
0b00000000 0b00000000 0b00000000 0b01100100 | |||
</syntaxhighlight> | |||
</blockquote><!-- UTF-32 --> | |||
</blockquote><!-- Encodings --> |
Latest revision as of 15:01, 8 August 2022
Unicode is a standard for text encoding.
It defines a mapping of integers to characters in various languages (code-points).
Various text-encodings alter how the integer is divided across byte(s),
but regardless of it's composition, the assigned number/character is constant.
For example, UTF-8 uses the first 1-5 bits of a byte to indicate the type of byte, and if the number spans multiple bytes.
The remaining bits are assembled into one large integer, that refers to a code-point/character.
UTF-1,7,8,16,32 all map to the same character set defined by unicode.
Documentation
wikipedia: unicode https://en.wikipedia.org/wiki/Unicode wikipedia: utf-8 https://en.wikipedia.org/wiki/UTF-8 Unicode in C++ (video) https://www.youtube.com/watch?v=tOHnXt3Ycfo
Tutorials
tl;dr unicode https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Endianness
Since unicode encodings use multiple bytes per character,
we need to know the order in which to read the bytes.The first 2-bytes of a file or text-stream contain the byte order mark
which indicate if this is big endian, or little endian.For more details, see endianness.
254 255 # (0xFEFF) big endian 255 254 # (0xFFFE) little endian
Encodings
UTF-1
UTF-7
A proposed alternative to UTF-8. It is infrequently used, and denylisted from HTML5.
See wikipedia.UTF-8
Predominantly used by the internet.
- the smallest unit is 8-bits
- reverse compatible with ASCII
- multiple bytes may be used for characters (1-4 bytes may be used)
- UTF-8 is extendable, we can add a 5th byte later if we need it.
- byte-order NOT determined by system's endianness
01111111 # ascii only uses 7-bits, the leading bit is always 0 1....... # utf-8 puts a '1' in the leading bit to indicate a multi-byte character # the number of leading '1's in the first byte of a multi-byte character # indicates how many following bytes will be used to represent the character # (this one will have 3x bytes after it) 1110.... # each following byte will have the prefix '10', indicating it is a 'continuation byte' 10...... # a byte beginning with '0' indicates this is ascii, and no following bytes are required 0.......When looking up a multibyte character, all bytes are combined into one very large binary integer.
# byte-1 byte-2 byte-3 byte-4 [1110]1001 [10]100100] [10]010010 [10]000000 # combine to the binary number 1001 100100 010010 000000 1001100100010010000000 # in decimal 2507904UTF-16
Predominantly used by windows.
- the smallest unit is 16-bits
- all characters use either 2-bytes, or 4-bytes
- not compatible with ASCII - it must be re-encoded
- byte order determined by endianness
UTF-32
UTF-32 always uses 4-bytes for every character.
Like ASCII, it is a 1:1 mapping between characters and code-points.
- the smallest unit is 32-bits
- byte order determined by endianness
# code-point for 'd' is 100 # in binary, this is 0b1100100 # in UTF-32 this is # [byte-1] [byte-2] [byte-3] [byte-4] 0b00000000 0b00000000 0b00000000 0b01100100