Internationalisation

There’s a useful trick that I first read about from a Microsoft developer, but I can’t remember which. If you want to confirm that your application properly handled international characters, put the following into it:

Iñtërnâtiônàlizætiøn

Lots of non-English characters, but still readable. Since I can never remember what this phrase is, I’m recording it here for my future self.

12 Responses to “Internationalisation”

  1. I associate that string with Sam Ruby: http://www.intertwingly.net/stories/2004/04/14/i18n.html

    I think he works for IBM.

    Will
  2. Will: well, that’s where I got it from this time, but I definitely saw it from an MS guy first. The MS guy might have robbed it from Sam, mind ;)

    sil
  3. Particularly if it’s a terminal app, you would probably want to test it with some double-width characters. (Kanji etc. generally take two columns of monospaced font.)

    Will
  4. (I’m not the Will of the first comment, incidentally… :-))

    Will
  5. Interestingly, those characters are all in the Latin-1 range. You might want to try using something outside latin-1 to check that you’re really using Unicode. I tend to use “Ādam” — the first charact is U+0100, which quickly shows up troublesome software.

    Dominic
  6. [Note: not either of the former Wills]

    I’m a former Mac Office tester. Although these quick-check strings are useful, you’ll need to do an immense amount of work if you’re serious about providing “full” Unicode support. (”Full” is in quotes, because *nobody*, including Microsoft and Apple, fully supports all the languages in the Unicode spec.)

    Your first task is to figure out which languages will get which levels of support in your app. For this, you’ll need to make some decisions about your target audience.

    Do you want to focus on the North American market, North America plus Europe, North America + ‘EMEA’ (MS-speak for Europe, Middle East, and Asia), Africa, academic language researchers, etc?

    For example, do you really need to support Ethiopic/Amharic? You do if you want to sell into the African market or to Bible scholars. Do you want to sell into the Middle East? Better make sure you can handle bi-directional text editing. Do you want to sell in Japan? Add vertical text editing.

    After you’ve sorted this out, you have a list of Unicode ranges, and you have an idea of what level of functional support you’re building into your app and how much testing time you’re going to spend on each Unicode range.

    Next decision is what level of functionality you promise to your core-customer languages, your 2nd tier, your 3rd tier, and then the 12 guys who might actually want to use the Shavian phonetic alphabet Unicode range.

    Here’s an example:

    Tier 1 - All app dialogs handle Unicode ranges required for Tier 1 languages, app UI fully localized in desired language (including Date, Time, Currency and Calendar preferences), printing to wide range of standard printers tested rigorously, app registers its language preference with OS properly (if OS supports said language), intra-app search and dictionary functions support all Tier 1 languages. Text editing handles all standard input methods and directionality choices for the supported languages.

    Tier 2 - same as above, minus date/time/currency/calendar and search/dictionary services. Less time spent debugging less-commonly-used functions, Add disclaimers re: known-good functionality.

    Tier 3 - Unicode compatibility not guaranteed, but [list of basic functions] should work properly in [list of languages].

    Tier 4 - Not supported

    ============

    So — “Iñtërnâtiônàlizætiøn” is fine, but only for a very small range of fine.

    Will Parker
  7. P.S. - In the previous comment, wherever I said ‘unsupported’, take that to mean ‘it might not do what you want it to do, but we’ve checked to make sure you won’t crash the app when you try it’.

    Will Parker
  8. Will: I’m not suggesting it’s a comprehensive Unicode test suite :) It’s a smoke test; chuck that string in and if it comes out the other end you’re not doing anything massively stupid in the middle anywhere like encoding everything into ASCII.

    sil
  9. The ‘degree’ of massively stupid is very diverse among software apps / programming environments. Take this for example:

    Python:

    >>> len(”Ādam”)
    5
    >>> “Ādam”[1:]
    ‘\x80dam’

    Ruby:

    > “Ādam”.length
    => 5
    > “Ādam”[1,3]
    => “\200da”

    The length should’ve been 4 and the substring should’ve returned ‘dam’ obviously.

    >>> len(”Iñtërnâtiônàlizætiøn”)
    27

    Maybe we’re all wrong and i18n isn’t the correct abbreviation after all : )

    These are trivial string operations but do illustrate the challenges of working with Unicode. Telling where problems will arise can be quite tricky.

    Simon de Haan
  10. Ouch, looks like my linebreaks broke. Hope the examples are still understandable.

    Simon de Haan
  11. With regard to Simon’s comments, it’s worth pointing out that at least in Python it’s possible to get these things right with stuff that comes with Python out of the box; namely, you have to specify that you’re using a Unicode string:

    # -*- coding: utf-8 -*-

    print len(’Iñtërnâtiônàlizætiøn’)

    print len(u’Iñtërnâtiônàlizætiøn’)

    print len(u’Ādam’)

    print u’Ādam’[1:]

    Will give:

    27

    20

    4

    dam

    I’m no Ruby expert, but Ruby’s Unicode is way worse, and unfortunately, Matz doesn’t like Unicode [1], and isn’t making it a priority. Instead, for the next version of Ruby he says he’s working on Character Set Independence.

    Sigh. Meanwhile, people have to jump through bizarre hoops to get Unicode to work with Ruby.

    You know, it’s funny, the one language that’s gotten Unicode right from the beginning is…

    Javascript.

    [1] http://use.perl.org/~mr_bean/journal/29341

    Patrick Hall
  12. [...] Made up by Stuart: Iñtërnâtiônàlizætiøn. Unfortunately no Slovenian special characters in it. Can I suggest that he changes the z into ž? Add to bookmarks, Digg, Reddit, Del.icio.us, Ma.gnolia, Technorati. [...]

    outbreak » Internationalization (written on November 23rd, 2006 by Marko Mrdjenovic)

Leave a Reply