Punycode


Punycode is an encoding scheme, which converts the strings of Unicode (UTF-8) into ASCII characters and vice versa. The ASCII character set is used in the preparation and processing of domain and host names. For international domains, the IDNA (Internationalizing Domain Names in Applications) standard is used, which complements the ASCIII character set. Punycode is an important part of this standard. It should also allow a one to one conversion.

General information

Punycode addresses the problem that certain characters such as umlauts, diacritics, and letters that do not belong to the Latin alphabet cannot be converted when displaying international domains (IDNs). The original Domain Name System (DNS) did not predict for this situation. German, for example, includes umlauts, i.e. ä, ü, and ö, diacritics like an accent on the e are also common such as Café. Since the authorities of the respective countries responsible for name assignments individually control which characters are allowed to register a domain, an Internet standard was necessary, which could work at the international level. Such standards were created with IDNA2003 and later IDNA2008.

Example

If a user were to enter books.com (Bücher, German word for books) or café.com, the browser would read that string and convert it with the Punycode syntax. “books.com" is converted to “xn-books-kva.com” and “café.com” to “xn-caf-dma.com.” The last two strings can be interpreted by the browser if it supports the IDNA function and it can then assign them the correct address. Otherwise, it would display an error message or a blank page.

How it works

Every domain that contains Unicode characters is normalized as a first step. This is done by the client (browser or email client) and the name prep process.[1] Uppercase letters are converted to lowercase and similar characters are exchanged. The string CAFé is converted to “cafe.” With the latest version of the standard (IDNA2008), normalization is outsourced to the user interface and is no longer part of the actual standard.

After normalization, all non-ASCII characters are removed from the domain name and the derived string is then inserted. The place where the original string is located and the type of character is encoded or stored in Unicode notation in this string. Punycode also prefixes the original string. Any international domain name starts with “xn--.” This prefix has been chosen because it most likely will not code conflict with the ASCII character set because of the low abundance in natural languages.

These steps are set in motion by an overlying algorithm bootstring. This algorithm is defined in the RFC3492 (IDNA2003) standard. Punycode is effectively an instance of Bootstring and specifically tailored to the requirements of IDNA.

Features

Currently there are two standards, IDNA2003 and IDNA2008. They do not contain the same specifications and the later version may not always be supported by the various browsers. In addition, the new standard does not include approximately 8000 characters which were provided in the old standard.[2] This may cause previously valid domains to suddenly be invalid. These problems will no longer exist with browsers increasingly supporting the standards.[3]

Relevance to search engine optimization

Punycode is especially relevant to search engine optimization for registration and linking of IDNs (umlaut domains). A punycode notation for umlaut domains can be registered at the German issuance authority (DENIC) under the menu item IDN. Google recommends specifying a Punycode version at Google Analytics or AdWords so the domain address is referenced correctly.[4] Tools such as backlink checker also need this type of notation because they cannot handle with umlauts.

However, umlaut domains are not disadvantaged in the organic search. Merely referencing with other tools is affected. Search engines usually work with international character sets and have more fonts available for specific countries. They handle the umlauts using Punycode and show the correct results for the searches in question.[5]

However, other aspects such as the competition or the e-mail program must be observed. Anyone who registers a domain would usually also register the umlaut spelling so as not to give competitors the opportunity to create a similar project. Moreover, email programs have trouble with umlauts and cannot process the corresponding email addresses.

References

  1. Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN) ietf.org. Accessed on 09/08/2015
  2. Internationalized Domain Names (IDN) FAQ unicode.org. Accessed on 09/08/2015
  3. Unicode IDNA Compatibility Processing unicode.org. Accessed on 09/08/2015
  4. In-page analysis and URLs with non-standard characters support.google.com. Accessed on 09/08/2015
  5. Internationalized Domains and SEO moz.com. Accessed on 09/08/2015

Web Links