Sorting Country Names in their Native Language

I used to think that sorting things was easy. Collation is a really difficult problem, especially once you start considering different script (Latin, Chinese, German, etc.) and numeral systems (Western Arabic, Hindi, Japanese, etc.) in the same list, not to mention locale-specific sorting irregularities like German Phonebook sorts.

The problem of sorting country names is particularly sensitive.  When you want to display China as 中国 to Chinese speakers, where should it be sorted compared to Canada or Kâmpŭchea (Cambodia)?

Here we have an example list of countries, in the order I looked them up online, heh.  For the sake of not messing with my blog, I avoided Right-to-Left country names for Egypt, Iran, or Israel.

  • United States
  • España
  • 中国
  • Deutschland
  • Polska
  • Россия
  • भारत

If you had a “sort” feature in whatever software you’re using, hopefully it’s using the Unicode collation order to sort the names. You typically would get something like this as a result:

  1. Deutschland
  2. España
  3. Polska
  4. United States
  5. Россия
  6. भारत
  7. 中国

Business Case Sort

Alpha sorting, Unicode or otherwise, may seem pretty arbitrary especially if 95% of your customers come from three or four countries.  One can always make the case for sorting country lists with the most popular countries dominating the “top 5″ or so of the list *. For many businesses this may mean a sort order of:

  1. United States
  2. 中国
  3. Россия
  4. Deutschland
  5. España
  6. Polska
  7. भारत

That’s all good … if you want to confuse Indian and Polish visitors, don’t care about keyboard users, and want to take a big hit on your Russian branding.
* Instead of mucking with collation, a usable solution is to autodetect what country people are from and pre-selecting things in dropdown lists, or highlighting it as a choice outside of the sorted list.

ISO to the Rescue

In my research, I’ve found a pretty good general solution, irrespective of the business case, is to sort things according to the ISO 3166-1-2 code. I know, I know, it’s lame and old and under fire constantly .. but it’s a fairly standard coding that technical people understand, native speakers understand, keyboard access is alright, and it’s considered safe on the culture-war front (other than being based on the Latin alphabet).

Our example above would be:

  1. 中国 (cn)
  2. Deutschland (de)
  3. España (es)
  4. भारत (in)
  5. Polska (pl)
  6. Россия (ru)
  7. United States (us)

Anywho, that’s just my suggestion for a starting point. Your business case may indeed support other sort orders for countries.  But this one is reproducible and defensible, so that makes it good for programmers and business analysts alike.

Happy IDN Day!

Today is the day that internationalized domain names (IDNs) go live on the internet. As someone really interested in globalization, this is a huge development: this is the first time non-latin characters can be used as domain names in the public internet. Arabic nations especially are loving this, and I’m sure Hebrew and Chinese language domain names will surely follow within hours.

I asked an Egyptian co-worker what the URL was for the Egyptian Ministry of Communication .. I didn’t even know how to search for it. Here it is:

http://وزارة-الاتصالات.مصر/. Apparently the fonts I have butcher the script.

The URL looks good when I copy/paste, but it gets turned into Punycode: http://xn--4gbrim.xn----rmckbbajlc6dj7bxne2c.xn--wgbh1c, which magically still works. Anyone else have some insight into how this works?

Proudly powered by WordPress
Theme: Esquire by Matthew Buchanan.

Switch to our mobile site