Before 2003 all domain names consisted only of simple ASCII-characters, so domain names in other character sets than latin were not possible. This severely limited the internationalization of the internet. Back in 1996 Martin Dürst1 first proposed introducing an extension so almost any unicode character could be used in domain names.
It took 7 years before the IETF finalized the first standard for Internationalized Domain Names in Applications (IDNA2003)2. The way all these new characters ware introduced without breaking DNS and other existing protocols and applications was by “bootstringing” the characters in ASCII;
To explain punycode, a general used example is the domain name “bücher.example” containing the German words voor books, “Bücher”. Because of the fact that characters like (00FC)”ü” are not present in ASCII, ‘punycode’ was introduced to convert strings with unicode characters to punycode-strings only consisting of ASCII characters. All unicode characters are first removed from the domain name, resulting in “bcher”. Because at least one character got removed the remaining string is prepended with a dash resulting in “bcher-“. If non-ascii characters are found, those characters are converted using nameprep3. First, Nameprep removes a set of characters4 and converts a set of capital letters to lowercase5. After that, the unicode characters are normalized using the Unicode KC-standard6. There is a lot of canonical unicode characters and different ways how they can be compositioned, so the characters (212B)”Å” and (00C5)”Å” for example are now both normalized into the character (00C5)”Å”. Subsequently, the result is checked against a blacklist of characters7. If any prohibited character is found the domain name is deemed invalid. Lastly, the characters are checked for any presence of bidirectional characters. If that’s the case, the whole string is checked for bidirectionality8. Next, the characters are converted into a code number representing the character insertion point and the corresponding character. We need to skip 1 character before we insert the “ü” back because it was the second character. There are a total of 6 places we could insert the character. Using a very clever formula we multiply the code point of the character minus the end code point of ASCII by the total amount of possible insertion point and add the actual insertion point. We get the number 745 ((252 - 128) * 6 - 1), which can be converted in this case (I will not go into this process in detail) to “kva”. This is appended to the remaining string, and the string is prepended by “xn–” to identify it as a punycode domain. We now have our result “xn–bcher-kva.example”.
Since punycode’s introduction back in 2003 some minor things have changed. A new IETF Worrking Group was composed and finalized an extension to IADNA2003, IDNA20089. Top level domains can now, starting from 2009 also be expressed in punycode. (The fairly new TLD for Saudi Arabia, السعودية is represented by xn–mgberp4a5d4ar)
The problem with punycode
The earliest communication about possible abuse of punycode IDNs I could find was back in january 2005, on the firefox forum10, but the problem may already have been widely known. An example of spoofing the url for “paypal.com” was provided as “pаypal.com”, represented in punycode as “www.xn–pypal-4ve.com”. The only difference between the two is that the first occurence of the character (0061)”a” is replaced by (0430)”а”. Can you spot the difference? The characters may vary according to what operating system you use and what font you use.
The above example with one or more, but not all replaced characters has since been solved in all major browsers by detecting different script sets according to UTS-3911, and if so, displaying the punycode representation of the domain name and not the unicode equivalent.
But domain names fully existing of characters in the same script set are still displayed in their native scriptset, and distinguishing between them is almost impossible. The best example is the real Apple website apple.com and the punycode counterpart аррӏе.com. Different browsers handle these domains very differently. In Chrome, IE and Edge the url is displayed as punycode in the url tooltip in the bottom left corner and the address bar, and in the future Chrome will warn about “lookalike urls” (When testing in canary version 75 I get a “Continue to apple.com” lookalike warning). In Firefox these url look identical, both when shown as a URL and in the address bar12.
There are a couple of possible solutions that could be implemented.
The first possibility is implementing a stricter nameprep algorithm that converts all homographs to a single character so it is not possible to register similar looking domain names anymore. But there are a couple problems with that. Firstly, what characters would be “similar” enough to limit them and what characters are distinct enough? Should we limit a “0”(zero) because an “o”(the 15th letter) is too similar for example? Secondly, what font do we use to determine the similarities between two graphs? Then there’s the backwards compatibility. 16 Years after the introduction of the possibility to register homographs it would be a nightmare to retroactively restrict homographic domain names, so this is neither a valid nor a realistic option.
Another possibility suggested is disabling punycode to unicode conversion for certain demographics and localities. People only using latin characters have the most known homographs with cyrillic and the other way around. We could disable the conversion to unicode for English native speakers, but what about people using Cyrillic languages, that may also use English as a second language? They would still be vulnerable. A more drastic approach proposed is disabling punycode completely, but that would brake support for most languages in the world. Solving this problem only for latin and cyrillic is a very western-centric way of thinking about this problem anyway. Other languages have a similar problem as well. In Japanese, the Kanji (53E3)”口” is very similar to the katakana (30ED)”ロ” depending on the browser font. They are in a different character set as well, which makes this even harder to detect.
Another possibility is to collect all the cyrillic characters in a a label and then see if they are all in a blacklist of confusables. The list consists -besides others- of characters like (0430)”а”, (0440)”р”, (04CF)”ӏ” and (0435)”е”. The above homograph аррӏе.com domain would fall into this category as all the characters are in the blacklist. If the TLD (Top Level Domain) is in ASCII, the domain name is shown as punycode. This solves the case of аррӏе.com, but a big problem with this is that it breaks unicode display for a lot of legit domains as well. The Top Level Domain for the EU, .eu is used by Bulgaria for example, that uses cyrillic characters. But this mitigation is why chrome displays punycode now, and will display the warning in the current canary version and future stable versions. This is still a very western centric solution and leaves the Japanase example above untouched though.
As you can read above this is not a very easy problem to solve. There are a lot of complicating factors, and the particularities of every different language and script has to be taken into account when trying to solve this problem. I think to best solution is to introduce one or more “Homograph-safe fonts” where the graph for every character is unique and distinct enough from other graphs. This homograph-safe font is to be used for any place where URLs are displayed in browsing context, especially the address bar and the url tooltip. This font could also be used anywhere where the security of a system is dependant of a user checking a string against another string. I don’t have the time nor the expertise to develop one such font, but i would expect one or more of these to be created from existing fonts, where similar characters are replaced by visual distinct ones. For Firefox, the font used on Android is “Roboto”13 for example, that is openly available on github14. Where there are currently variations available for “Regular”, “Bold” and “Thin”, a fourth one could be added for “Homograph-safe”. Because updating a lot of fonts at once would be a big change with a lot work required, an intermediate solution where a seperate font is used across all platforms for displaying urls specifically could be used temporarily.
What do you think, is this the perfect compromise and solution?