Unicode: Difference between revisions

From wikinotes
No edit summary
Line 19: Line 19:
|}
|}
</blockquote><!-- Documentation -->
</blockquote><!-- Documentation -->
= Encodings =
<blockquote>
== UTF-1 ==
<blockquote>
</blockquote><!-- UTF-1 -->
== UTF-7 ==
<blockquote>
</blockquote><!-- UTF-7 -->
== UTF-8 ==
<blockquote>
Predominantly used by the internet (since reverse-compatible with ascii, which much internet was written to support).
* reverse compatible with [[ASCII]]
* multiple bytes may be used for characters (1-4 bytes may be used)
* UTF-8 is extendable, we can add a 5th byte later if we need it.
<syntaxhighlight lang="bash">
01111111  # ascii only uses 7-bits, the leading bit is always 0
1.......  # utf-8 puts a '1' in the leading bit to indicate a multi-byte character
# the number of leading '1's in the first byte of a multi-byte character
# indicates how many following bytes will be used to represent the character
# (this one will have 3x bytes after it)
1110.... 
# each following byte will have the prefix '10', indicating it is a 'continuation byte'
10......
# a byte beginning with '0' indicates this is ascii, and no following bytes are required
0.......
</syntaxhighlight>
When looking up a multibyte character, all bytes are combined into one very large binary integer.
<syntaxhighlight lang="bash">
#  byte-1      byte-2      byte-3      byte-4
[1110]1001  [10]100100]  [10]010010  [10]000000
# combine to the binary number
1001 100100 010010 000000
1001100100010010000000
# in decimal
2507904
</syntaxhighlight>
</blockquote><!-- UTF-8 -->
== UTF-16 ==
<blockquote>
Used by windows.
</blockquote><!-- UTF-16 -->
== UTF-32 ==
<blockquote>
</blockquote><!-- UTF-32 -->
</blockquote><!-- Encodings -->

Revision as of 23:02, 4 August 2021

Unicode is a standard for text encoding.
It defines a mapping of integers to characters in various languages (code-points).
Various text-encodings alter how the integer is divided across byte(s),
but regardless of it's composition, the assigned number/character is constant.

For example, UTF-8 uses the first 1-5 bits of a byte to indicate the type of byte, and if the number spans multiple bytes.
The remaining bits are assembled into one large integer, that refers to a code-point/character.

UTF-1,7,8,16,32 all map to the same character set defined by unicode.

Documentation

wikipedia: unicode https://en.wikipedia.org/wiki/Unicode
wikipedia: utf-8 https://en.wikipedia.org/wiki/UTF-8

Encodings

UTF-1

UTF-7

UTF-8

Predominantly used by the internet (since reverse-compatible with ascii, which much internet was written to support).

  • reverse compatible with ASCII
  • multiple bytes may be used for characters (1-4 bytes may be used)
  • UTF-8 is extendable, we can add a 5th byte later if we need it.
01111111  # ascii only uses 7-bits, the leading bit is always 0
1.......  # utf-8 puts a '1' in the leading bit to indicate a multi-byte character

# the number of leading '1's in the first byte of a multi-byte character
# indicates how many following bytes will be used to represent the character
# (this one will have 3x bytes after it)
1110....  

# each following byte will have the prefix '10', indicating it is a 'continuation byte'
10......

# a byte beginning with '0' indicates this is ascii, and no following bytes are required
0.......

When looking up a multibyte character, all bytes are combined into one very large binary integer.

#  byte-1      byte-2       byte-3      byte-4
[1110]1001  [10]100100]   [10]010010  [10]000000

# combine to the binary number
1001 100100 010010 000000
1001100100010010000000

# in decimal
2507904

UTF-16

Used by windows.

UTF-32