i18n is a hard and largely unsolved problem

Home » Blog » Software » Enterprise Software » i18n is a hard and largely unsolved problem

After last week’s post about the intricacies of dealing with date and time representations in software I promised to write about another seemingly simple yet surprisingly complex area of software development: Internationalization.

Some time ago a I wrote about an interesting presentation on i18n and localization in Rails by Heather Rivers of Yammer.

If you’re in any way dealing with internationalization (i18n) and localization (L10n) of software (which you basically should if you’re into software development) have a look at the video of the presentation and her slides on slideshare.

She raises quite a few interesting points in that talk. Usually, we’ve come to see i18n as a problem that’s largely solved in terms of software development. We have our language files – be that Java-style .properties or Ruby-esque .yml files – we write key-value pairs of variables and the application content – usually in the application’s primary language only – and hand over those files to translators once we’re done. Easy job, ain’t it? Well, not quite.

First, as Heather puts it: Unlike programming languages, natural languages don’t come with a specification – unless they’re French, that is. Natural language is subject to change over time (which historical linguistics deals with extensively) as well as in terms of the context it is uttered in (pragmatics). In a way you could say the way natural language is structured is quite similar to Perl’s TIMTOWTDI paradigm. Language can never be perceived without context and this has to be taken into consideration when designing an application for internationalization right from the start: Whom do I want to address? Do I want to address her in a formal or an informal manner? This particular problem can be mostly avoided in English but will already cause some trouble in other European languages where you have different pronouns for addressing someone informally rather than formally.

Then there’s that pesky problem that morphology (the way words are structured), syntax (the way sentences are composed) and semantics (the way in which those sentences convey meaning) can differ a lot from one language to another.

While English for instance is what linguists call an analytic language – meaning that English has a low morpheme-per-word ratio and conveys relations by word order rather than by inflection or agreement between words – many languages of the world tend to be on the other end of the spectrum: They’re called synthetic or agglutinative languages. These languages such as Turkish, Finnish or the Arabic language family sport a high morpheme-per-word ratio and often have single words where in other languages you would have a whole sentence. While this might sound a bit aloof and over-the-top for everyday programming (and probably is) such differences can indeed cause problems when considering the common interpolation pattern in i18n:

"Welcome back %{user}. You have %{number} new messages."

This pattern already causes trouble with the awful German language and you’d probably have a hard time translating that particular pattern into say Turkish properly.

These aspects boil down to a single underlying issue: If you want to translate properly from one language to another you can’t work with words or string representations but you have to use representations of the actual meaning of particular pieces of content instead. The problem is, most of the time this isn’t worth the effort. Brilliant linguists have been struggling for years and still haven’t arrived at a comprehensive and consistent system for describing meaning in a very well researched language such as English not to speak of the lesser researched languages of the world.

This is why i18n although seemingly simple on the surface is still a hard and largely unsolved problem. Other more technical but by no means trivial i18n-related issues such as proper (preferably UTF) encoding of strings even pale a little in comparison to that. Date and time conversions for specific languages can be quite a pain, too.

So, by all means don’t overcomplicate matters when internationalizing your application but you might want to keep in mind that i18n isn’t as simple as it might seem at the first glimpse.

5 Comments
  1. Frisian July 1, 2013 at 4:27 pm

    I beg to differ. I18N is only difficult, if you don’t take it into account right from the beginning. If you know the locales in advance, you can ask experts and identify the particular problems.
    Even the plain Java MessagePattern class can handle numbers quite well (e.g. “no new messages”, “one new message”, “2 new messages” etc.).

  2. Pingback: FormatJS: Localize dates and numbers | Björn Wilmsmann

Leave a Comment

By continuing to use the site you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or if you click "Accept" below then you are consenting to this.

Close