Unicode


The Unicode is a Character Encoding Scheme (CES) that describes the international standard character set used in computers. Unicode is an attempt to create a compendium of all existing text characters around the world. These include the Greek, Cyrillic, Arabic, Hebrew and Thai alphabet, as well as Japanese, Chinese and Korean characters. Even mathematical or commercial Special Characters are contained in Unicode.

Word-processing programs and HTML coding on the Internet serve as an example for the practical use of the Unicode. The database for Unicode characters contains about 230,000 characters and has a reserve of another one million characters. In addition to Unicode, there are other, different, and incompatible character sets.

Types of Unicode

ASCII

The most basic character set on the Internet is the ASCII (American Standard Code for Information Interchange). In ASCII, a maximum of 128 characters is possible, since each character is encoded with 7 bits. It mainly contains the letters of the Latin alphabet, which are used in English and the Arabic numerals. ASCII is not very common in the European region because the frequently used umlauts are not displayed. In the Asian region, ASCII is likewise not very feasible, since the characters cannot be displayed.

UTF-16

The “Basic Multilingual Plane” (BMP) allows 65,536 characters and is encoded with the “Universal Character Set 2” (UCS-2). The 2 in UCS-2 indicates that two bytes, i.e. 16 bits, are used for encoding each character. Therefore, UCS-2 is also often called UTF-16 (UCS transformation format 16 bits). The first 265 characters of this character set contain the characters of the West-European languages.

UTF-32

16 bits are often insufficient for historical characters, such as ancient Egyptian hieroglyphs or rare Chinese characters. For this purpose, each character is encoded with 32 bits. A total of 4,294,967,296 different characters are possible. However, the high storage space requirement should be considered when using UTF-32.

UTF-8

The most commonly used character set in Europe is UTF-8. It can express any Unicode character as a sequence of 8-bit-length data words, and thus can convert 16-bit encoded characters to characters with only 8 bits. The first 128 characters match the ASCII.

Current use

Nowadays the Unicode standard is used by many leading companies such as Apple, IBM, Microsoft, or Hewlett-Packard, but it is slowly catching on. Moreover, the cross-platform programming language Java and the Microsoft operating system NT work internally with the Unicode. From a usability perspective and also to be able to address as many users as possible, the UTF-8 character set should be used. This includes a large number of common characters used throughout the world and consumes a small amount of memory. If only the ASCII is used to encode the characters, no umlauts can be displayed.