Latest revision as of 15:01, 8 August 2022

Unicode is a standard for text encoding.
It defines a mapping of integers to characters in various languages (code-points).
Various text-encodings alter how the integer is divided across byte(s),
but regardless of it's composition, the assigned number/character is constant.

For example, UTF-8 uses the first 1-5 bits of a byte to indicate the type of byte, and if the number spans multiple bytes.
The remaining bits are assembled into one large integer, that refers to a code-point/character.

UTF-1,7,8,16,32 all map to the same character set defined by unicode.

Documentation

wikipedia: unicode https://en.wikipedia.org/wiki/Unicode

wikipedia: utf-8 https://en.wikipedia.org/wiki/UTF-8

Unicode in C++ (video) https://www.youtube.com/watch?v=tOHnXt3Ycfo

Tutorials

tl;dr unicode https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Endianness

Since unicode encodings use multiple bytes per character,
we need to know the order in which to read the bytes.
The first 2-bytes of a file or text-stream contain the byte order mark
which indicate if this is big endian, or little endian.
For more details, see endianness.
254 255  # (0xFEFF) big endian
255 254  # (0xFFFE) little endian

Encodings

UTF-1

UTF-7

A proposed alternative to UTF-8. It is infrequently used, and denylisted from HTML5.
See wikipedia.

UTF-8
Predominantly used by the internet.

the smallest unit is 8-bits

reverse compatible with ASCII

multiple bytes may be used for characters (1-4 bytes may be used)

UTF-8 is extendable, we can add a 5th byte later if we need it.

byte-order NOT determined by system's endianness
01111111  # ascii only uses 7-bits, the leading bit is always 0
1.......  # utf-8 puts a '1' in the leading bit to indicate a multi-byte character

# the number of leading '1's in the first byte of a multi-byte character
# indicates how many following bytes will be used to represent the character
# (this one will have 3x bytes after it)
1110....  

# each following byte will have the prefix '10', indicating it is a 'continuation byte'
10......

# a byte beginning with '0' indicates this is ascii, and no following bytes are required
0.......
When looking up a multibyte character, all bytes are combined into one very large binary integer.
#  byte-1      byte-2       byte-3      byte-4
[1110]1001  [10]100100]   [10]010010  [10]000000

# combine to the binary number
1001 100100 010010 000000
1001100100010010000000

# in decimal
2507904
UTF-16

Predominantly used by windows.

the smallest unit is 16-bits

all characters use either 2-bytes, or 4-bytes

not compatible with ASCII - it must be re-encoded

byte order determined by endianness

UTF-32
UTF-32 always uses 4-bytes for every character.
Like ASCII, it is a 1:1 mapping between characters and code-points.

the smallest unit is 32-bits

byte order determined by endianness
# code-point for 'd' is 100
# in binary, this is 
0b1100100

# in UTF-32 this is
# [byte-1]   [byte-2]   [byte-3]   [byte-4]
 0b00000000 0b00000000 0b00000000 0b01100100

@@ Line 13: / Line 13: @@
 {| class="wikitable"
 |-
-| wikipedia || https://en.wikipedia.org/wiki/Unicode
+| wikipedia: unicode || https://en.wikipedia.org/wiki/Unicode
+|-
+| wikipedia: utf-8 || https://en.wikipedia.org/wiki/UTF-8
+|-
+| Unicode in C++ (video) || https://www.youtube.com/watch?v=tOHnXt3Ycfo
 |-
 |}
 </blockquote><!-- Documentation -->
+= Tutorials =
+<blockquote>
+{| class="wikitable"
+|-
+| tl;dr unicode || https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
+|}
+</blockquote><!-- Tutorials -->
+= Endianness =
+<blockquote>
+Since unicode encodings use multiple bytes per character,<br>
+we need to know the order in which to read the bytes.
+The first 2-bytes of a file or text-stream contain the '''byte order mark'''<br>
+which indicate if this is big endian, or little endian.
+For more details, see [[endianness]].
+<syntaxhighlight lang="yaml">
+255  # (0xFEFF) big endian
+254  # (0xFFFE) little endian
+</syntaxhighlight>
+</blockquote><!-- Endianness -->
+= Encodings =
+<blockquote>
+== UTF-1 ==
+<blockquote>
+</blockquote><!-- UTF-1 -->
+== UTF-7 ==
+<blockquote>
+A proposed alternative to UTF-8. It is infrequently used, and denylisted from [[HTML|HTML5]].<br>
+See [https://en.wikipedia.org/wiki/UTF-7 wikipedia].
+</blockquote><!-- UTF-7 -->
+== UTF-8 ==
+<blockquote>
+Predominantly used by the internet.
+* the smallest unit is 8-bits
+* reverse compatible with [[ASCII]]
+* multiple bytes may be used for characters (1-4 bytes may be used)
+* UTF-8 is extendable, we can add a 5th byte later if we need it.
+* byte-order ''NOT'' determined by system's [[endianness]]
+<syntaxhighlight lang="bash">
+01111111  # ascii only uses 7-bits, the leading bit is always 0
+.......  # utf-8 puts a '1' in the leading bit to indicate a multi-byte character
+# the number of leading '1's in the first byte of a multi-byte character
+# indicates how many following bytes will be used to represent the character
+# (this one will have 3x bytes after it)
+....
+# each following byte will have the prefix '10', indicating it is a 'continuation byte'
+......
+# a byte beginning with '0' indicates this is ascii, and no following bytes are required
+.......
+</syntaxhighlight>
+When looking up a multibyte character, all bytes are combined into one very large binary integer.
+<syntaxhighlight lang="bash">
+#  byte-1      byte-2       byte-3      byte-4
+[1110]1001  [10]100100]   [10]010010  [10]000000
+# combine to the binary number
+100100 010010 000000
+1001100100010010000000
+# in decimal
+2507904
+</syntaxhighlight>
+</blockquote><!-- UTF-8 -->
+== UTF-16 ==
+<blockquote>
+Predominantly used by windows.
+* the smallest unit is 16-bits
+* all characters use either 2-bytes, or 4-bytes
+* not compatible with [[ASCII]] - it must be re-encoded
+* byte order determined by [[endianness]]
+</blockquote><!-- UTF-16 -->
+== UTF-32 ==
+<blockquote>
+UTF-32 always uses 4-bytes for every character.<br>
+Like [[ASCII]], it is a 1:1 mapping between characters and code-points.
+* the smallest unit is 32-bits
+* byte order determined by [[endianness]]
+<syntaxhighlight lang="python">
+# code-point for 'd' is 100
+# in binary, this is
+b1100100
+# in UTF-32 this is
+# [byte-1]   [byte-2]   [byte-3]   [byte-4]
+b00000000 0b00000000 0b00000000 0b01100100
+</syntaxhighlight>
+</blockquote><!-- UTF-32 -->
+</blockquote><!-- Encodings -->

Anonymous

Search

Unicode: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 15:01, 8 August 2022

Contents

Documentation

Tutorials

Endianness

Encodings

UTF-1

UTF-7

UTF-8

UTF-16

UTF-32

Navigation

Navigation

Programs

QuickRef

Operating Systems

wiki pages

Wiki tools

Wiki tools

wikipedia: unicode	https://en.wikipedia.org/wiki/Unicode
wikipedia: utf-8	https://en.wikipedia.org/wiki/UTF-8
Unicode in C++ (video)	https://www.youtube.com/watch?v=tOHnXt3Ycfo

Anonymous

Search

Unicode: Difference between revisions

Latest revision as of 15:01, 8 August 2022

Documentation

Tutorials

Endianness

Encodings

UTF-1

UTF-7

UTF-8

UTF-16

UTF-32

Navigation

Wiki tools

Page tools