KWWidgets/i18n

From KitwarePublic
Jump to: navigation, search

Internationalization

Internationalization of an application (i18n for short) involves far more than just translating its text messages to another message -- date, time and currency formats need changing too, some languages are written left to right and others right to left, character encoding may differ and many other things may need changing too -- it is a necessary first step. There is no support for i18n in KWWidgets yet, even though Kitware translated VolView in Chinese once. Let's have a look at how Unix (GNU's gettext), Windows, Qt and wxWindows handle the situation.

VV-Asahi

  • A text file (say JapaneseText.txt) is filled with entries separated by an empty line. Each entry is a token/sentence in English, and the corresponding token/sentence in Japanese.
  • The text file is processed by an executable and transformed into Widgets/vtkKWTranslation.h as an array of 'const char*' pairs, one for English, the other for Japanese.
  • A subclass of vtkKWApplication holds an instance of a vtkKWTranslator class. That class is used to translate an English string at run-time, and vice-versa (the lookup complexity is O(n)).
  • KWWidget classes were actually modified manually for that project so that they translate their input on the fly, i.e. if a vtkKWLabel is passed "Welcome", it tries to translate it automatically by looking up an entry for "Welcome" in the English<->Japanese translation table. This is impractical for us.
  • This translation framework does not support compound messages (positional parameter replacements, where special chars are replaced in the translated text, similar to printf), contexts, or plural forms.

JapaneseText.txt:

...
About
ã�,C�ã<ã�,B8ã�,C'ã3

Active Contour: 
ã�,B"ã/ã�,C�ã�,B#ã�,C�輪é-�(B: 

Widgets/vtkKWTranslation.h:

#define NUMBER_OF_TRANSLATION_LABELS 405
const char* TRANSLATION[][2] = {
...
{"About","ã�,C�ã<ã�,B8ã�,C'ã3�(B"},
{"Active Contour: ","ã�,B"ã/ã�,C�ã�,B#ã�,C�輪é-�(B: "},

Widgets/vtkKWLabel.cxx:

void vtkKWLabel::SetLabel(const char* l)
{
...
  if (this->Application->GetLanguage() == VTK_KW_LANGUAGE_JAPANESE)
    {
    newLabel = this->Application->GetTranslator()->Translate(l);
    }

gettext

GNU's gettext, part of the GNU Translation Project, is "a set of tools that provides a framework to help other GNU packages produce multi-lingual messages. These tools include a set of conventions about how programs should be written to support message catalogs, a directory and file naming organization for the message catalogs themselves, a runtime library supporting the retrieval of translated messages, and a few stand-alone programs to massage in various ways the sets of translatable strings, or already translated strings".


KwGridInfoIcon.png Note: Note that the runtime library is LGPL, "[...] This means in particular that even non-free programs can use `libintl' as a shared library, whereas only free software can use `libintl' as a static library or use modified versions of `libintl'. [...]" (see gettext-runtime/ COPYING, ABOUT-NLS). The few stand-alone programs required to massage the translatable strings are GPL (see gettext-tools/ COPYING), but are not required to run your localized application. The runtime library also depends on the libiconv library, which is LGPL too (see COPYING.LIB).
  • First step is identifying, right in the C sources, those strings which are meant to be translatable, and those which are untranslatable (using some macro/marker). In principle, a function call of gettext would do, but a shorthand syntax helps keeping the legibility of internationalized programs (say, '_()').
  • xgettext extracts all marked messages from a set of C/C++ files and initializes a PO file with empty translations. PO files are meant to be read and edited by humans, and associate each original, translatable string with its translation in a particular target language. A single PO file is dedicated to a single target language. Note that a PO file has sets of pointers to exactly where in C sources each string is used. Note that if a translation is not found, the msgid itself is used, which is most of the time in English (or another native language). So there is actually no real need for an English PO file. Note that the msgid is the native string itself, so it has to be unique. This makes translation of short words tricky, since they can collide (see note about GUI below).

fr.po:

#: hello.c:31
msgid "Hello, world!"
msgstr "Bonjour, le monde!"

#: hello.c:32
#, c-format
msgid "This program is running as process number %d."
msgstr "Ce programme est exécuté en tant que processus numéro %d."

hello.c:

#define _(string) gettext (string)
printf ("%s\n", _("Hello, world!"));
printf (_("This program is running as process number %d."), getpid ());
  • The PO file (Portable Object) is converted later on to a binary MO file (Machine Object) used at run-time for faster querying/retrieval. In the MO file, having the original strings sorted enables the use of simple binary search, for when the MO file does not contain an hashing table.
  • Translation units belong to "domains". Domains actually refer to unique applications. A domain can be selected at run-time, for example when dealing with messages from a library, as these have to be independent of the current domain set by the application. This can be done with a call to: char *dgettext (const char *domain_name, const char *msgid);
  • Coumpound messages: positional printf-style parameters are used to replace values in the translated string: for example, %1$d is similar to %d but explicitely binds this specification to the first parameter passed to printf in the source code, so that it can be referred *after* another parameter in the translated string, such as "Only %2$d bytes free on '%1$s'." which is semantically equivalent to "'%s' has only %d bytes free.". This is a POSIX/XSI feature and not specified by ISO C 99. Linux supports the syntax, Microsoft compilers would not until Visual C++ 2005's _printf_p (gettext provides replacement functions though).
  • Plural forms: there is a fairly advanced support for plural forms. As it simplest, the plural form of a noun is not simply constructed by adding an `s' (or something else) to a word. Plural forms can vary depending on the number of items (say 0, 1, 2, a range, etc). The handling of plural forms differs widely between the language families. The function: char * ngettext (const char *msgid1, const char *msgid2, unsigned long int n) is similar to the gettext function as it browses the message catalogs in the same way, but takes two extra arguments. The msgid1 parameter must contain the singular form of the string to be converted. It is also used as the key for the search in the catalog. The msgid2 parameter is the plural form. The parameter n is used to determine the plural form. If no message catalog is found msgid1 is returned if n == 1, otherwise msgid2.
printf (ngettext ("found %d fatal error", "found %d fatal errors", n), n);
#: src/msgcmp.c:338 src/po-lex.c:699
#, c-format
msgid "found %d fatal error"
msgid_plural "found %d fatal errors"
msgstr[0] "s'ha trobat %d error fatal"
msgstr[1] "s'han trobat %d errors fatals"
  • The solution implemented is to allow the translator to specify the rules of how to select the plural form. This information about the plural form selection has to be stored in the header entry of the PO file (the one with the empty msgid string). The plural form information looks like this: Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1; The nplurals value must be a decimal number which specifies how many different plural forms exist for this language. The string following plural is an expression which is using the C language syntax. Exceptions are that no negative numbers are allowed, numbers must be decimal, and the only variable allowed is n. This expression will be evaluated whenever one of the functions ngettext, dngettext, or dcngettext is called. The numeric value passed to these functions is then substituted for all uses of the variable n in the expression. The resulting value then must be greater or equal to zero and smaller than the value given as the value of nplurals.
  • Contexts: one place where the gettext functions, if used normally, have big problems is within programs with graphical user interfaces (GUIs). The problem is that many of the strings which have to be translated are very short. They have to appear in pull-down menus which restricts their length. One solution to this problem is to artificially enlengthen the strings to make them unambiguous. Example: Menu|File|Open, Menu|Edit|Copy. But what would the program do if no translation is available? The enlengthened string is not what should be printed, so provided that a unique syntax is followed (using '|' for example), a modified version of the gettext functions can be called to use the last part of the string as the default string (in our example: Open, Copy).

MFC - Visual Studio .NET 2003

  • MFC version 7.0 provides support for satellite DLLs, a feature that helps in creating applications localized for multiple languages. A satellite DLL is a resource-only DLL that contains an application's resources localized for a particular language. A resource file has to be created for each language and is loaded at run-time.
  • Resource files (.rc) aggregate the description of all UI elements in a text-form. They are not meant to be created by hand but using a visual editor. Resource files are compiled and linked to the application (or into satellite DLLs).
  • Each element in a resource file has a specific ID. A resource file is parsed and converted to a header file that associates this ID to a unique number using a #define. The application source code includes that header and uses that ID to refer to a specific resource element.
  • A string table is a Windows resource that contains a list of IDs, values, and captions for all the strings of an application. One can have several string tables — one for each language or condition. However, an executable module has only one string table. A running application can reference several string tables if you put the tables into different DLLs.

Main.rc:

STRINGTABLE 
BEGIN
    IDS_APP_TITLE           "Main"
    IDS_HELLO               "Hello World!"
END

Resource.h:

#define IDS_APP_TITLE                   103
#define IDS_HELLO                       104

Main.cpp:

LoadString(hSatDLL, IDS_HELLO, szHello, MAX_LOADSTRING);
DrawText(hdc, szHello, _tcslen(szHello), &rt, DT_LEFT);
  • A string table is a Windows resource that contains a list of IDs, values, and captions for all the strings of an application. One can have several string tables — one for each language or condition. However, an executable module has only one string table. A running application can reference several string tables if you put the tables into different DLLs.
  • This translation framework does not seem to support contexts, plural forms or compound messages (positional parameter replacements, where special chars are replaced in the translated text, similar to printf, where only introduced very recently by Visual C++ 2005's _printf_p).

Qt

Check the Qt's Linguist Manual: Programmers page for explanations and tutorials, as well as the Internationalization with Qt page. Qt behaves pretty much like GNU's gettext but uses its own set of tools and file format.

  • Translation files consist of all the user-visible text and key accelerators in an application and translations of that text.
  • The lupdate utility is run initially to generate a first set of .ts translation source files with all the user-visible text but no translations (this is similar to creating/extracting PO files using xgettext). User-visible strings are marked as translation targets by wrapping them in a tr() call (similar to _() for gettext).
  • The .ts files are given to the translator who adds translations using Qt Linguist.
  • lupdate is run to incorporate any new text added to the application. lupdate synchronizes the user-visible text from the application with the translations; it does not destroy any data (similar to gettext's merge).
  • lrelease is called to obtain a light-weight message file (a .qm file) from the .ts file, suitable only for end use. You can see the .ts files as "source files", and .qm as "object files" (similar to gettext's MO files). The translator edits the .ts files, but the users only need the .qm files. Both kinds of files are platform and locale independent.
  • Load the translation into the application:
int main( int argc, char **argv )
    {
        QApplication app( argc, argv );
        QTranslator translator( 0 );
        translator.load( "tt1_la", "." );
        app.installTranslator( &translator );
  • Translate the string using tr():
QPushButton *button = new QPushButton( tr("&Quit"), this);
 
  • Contexts: the lupdate program automatically provides a context for every source text. This context is the class name of the class that contains the tr() call. This is sufficient in the vast majority of cases. Sometimes however, the translator will need further information to uniquely identify a source text; for example, a dialog that contained two separate frames, each of which contained an "Enabled" option would need each identified because in some languages the translation would differ between the two. This is achieved using the two argument form of the tr(). This is actually an improvment over gettext, which does not provide a context (even though the PO file is aware of the location of the original string), and does not allow for an extra context to be specificed (gettext()'s workaround is to use a syntax like "Color frame|Enabled" and "Hue frame|Enabled", and use modified function that return the last part of the '|' separated string if no translation is found).
    rbc = new QRadioButton( tr("Enabled", "Color frame"), this );
    rbh = new QRadioButton( tr("Enabled", "Hue frame"), this );
  • If you need to have translatable text completely outside a function, there are two macros to help: QT_TR_NOOP() and QT_TRANSLATE_NOOP(). These macros merely mark the text for extraction by lupdate. The macros expand to just the text (without the context). Again, similar to gettext().
    QString FriendlyConversation::greeting( int greet_type )
    {
        static const char* greeting_strings[] = {
            QT_TR_NOOP( "Hello" ),
            QT_TR_NOOP( "Goodbye" )
        };
        return tr( greeting_strings[greet_type] );
    }
  • Since this framework is close to gettext(), I assume it supports compounds messages (provided that printf() is POSIX/XSI), and probably plural forms.

wxWidgets (wxWindows)

Check wxWidgets' i18n page. The wxWindows approach to i18n closely follows GNU gettext package. wxWindows uses message catalogs that are binary compatible with gettext catalogs. Check the gettext() section about PO/MO files, xgettext, etc, the tools and methodologies are exactly the same. wxWidgets use the same macro _() to mark/translate strings.

Tk

Check How to Use Tcl 8.1 Internationalization Features.

  • The msgcat package provides a set of functions for managing multilingual user interfaces. It allows you to define strings in a message catalog, which is independent from your application or package and which you can edit or localize without modifying the application source code.
  • The basic principle of the msgcat package is that you create a set of message files, one for each supported language, containing localized versions of all the strings your application or package can display.
  • Optionally set the locale using the ::msgcat::mclocale command.
  • Call ::msgcat::mcload to load the appropriate message files.
  • Instead of using a string directly, call the ::msgcat::mc command to return a localized version of the string you want. The mc command takes as an argument a source string and returns the translation of that string in the current locale.
::msgcat::mclocale "en_UK"
::msgcat::mcload [file join [file dirname [info script]] msgs]
puts [::msgcat::mc "Welcome to Tcl!"]
  • To use the msgcat package, you need to prepare a set of message files for your package or application, all contained within the same directory. The name of each message file is a locale specifier followed by the extension ".msg" (for example, es.msg for a Spanish message file).
  • Each message file contains a series of calls to ::msgcat::mcset to set the translation strings for that language. The format of the mcset command is: ::msgcat::mcset locale src-string ?translation-string?. The mcset command defines a locale-specific translation for the given src-string. If no translation-string argument is present, then the value of src-string is also used as the locale-specific translation string.
::msgcat::mcset es "Welcome to Tcl!" "¡Bienvenido a Tcl!"
::msgcat::mcset es "Select a color:" "Elige un color:"
  • This translation framework does not seem to support compound messages (positional parameter replacements, where special chars are replaced in the translated text, similar to printf), contexts, or plural forms.

Java

Check Trail: Internationalization, Java Internationalization: An Overview.

  • The internationalization feature of the JDK provides a mechanism for separating user interface (UI) elements and other locale-sensitive data from the application logic in a program. The JDK uses resource bundles to isolate localizable elements from the rest of the application. The resource bundle contains either the resource itself (also called properties) or a reference to it. With all resources separated into a bundle, the Java application simply loads the appropriate bundle for the active locale. This is similar to message catalogs.
  • Resource bundle names have two parts: a base name and a locale suffix. For example, suppose you create a resource bundle named MyBundle. This original MyBundle will be your default bundle, the one used when others cannot be found. However, in addition to the default bundle, you'll create other bundles for different locale, say MyBundle_ja_JP and MyBundle_fr_FR.

I18NSample.java:

Locale currentLocale;
ResourceBundle messages;
currentLocale = new Locale(language, country);
messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);

System.out.println(messages.getString("greetings"));
System.out.println(messages.getString("inquiry"));
System.out.println(messages.getString("farewell"));

MessagesBundle.properties:

greetings = Hello.
farewell = Goodbye.
inquiry = How are you?

MessagesBundle_fr_FR.properties:

greetings = Bonjour.
farewell = Au revoir.
inquiry = Comment allez-vous?
  • Coumpound messages: a compound message may contain several kinds of variables: dates, times, strings, numbers, currencies, and percentages. To format a compound message in a locale-independent manner, you construct a pattern that you apply to a MessageFormat object, and store this pattern in a ResourceBundle. This is similar to the positional printf-style arguments found in gettext, but in a more verbose way.

MessageBundle_en_US.properties:

template = At {2,time,short} on {2,date,long}, we detected {1,number,integer} spaceships on the planet {0}.
planet = Mars

Example.java:

ResourceBundle messages = ResourceBundle.getBundle("MessageBundle",currentLocale);

Object[] messageArguments = {
  messages.getString("planet"),
  new Integer(7),
  new Date()
};

MessageFormat formatter = new MessageFormat("");
formatter.setLocale(currentLocale);

formatter.applyPattern(messages.getString("template"));
String output = formatter.format(messageArguments);

Output:

At 1:15 PM on April 13, 1998, we detected 7 spaceships on the planet Mars.
  • Plural forms: the words in a message may vary if both plural and singular word forms are possible. With the ChoiceFormat class, you can map a number to a word or a phrase, allowing you to construct grammatically correct messages.

Output:

There are no files on XDisk.
There is one file on XDisk.
There are 2 files on XDisk.

ChoiceBundle_en_US.properties:

noFiles = are no files
oneFile = is one file
multipleFiles = are {2} files
pattern = There {0} on {1}.

Example.java

MessageFormat messageForm = new MessageFormat("");
messageForm.setLocale(currentLocale);

double[] fileLimits = {0,1,2};
String [] fileStrings = {
    bundle.getString("noFiles"),
    bundle.getString("oneFile"),
    bundle.getString("multipleFiles")
};

# ChoiceFormat maps each element in the double array to the element in the
# String array that has the same index. In the sample code the 0 maps to the
# String returned by calling bundle.getString("noFiles"). By coincidence the 
# index is the same as the value in the fileLimits array.

ChoiceFormat choiceForm = new ChoiceFormat(fileLimits, fileStrings);

String pattern = bundle.getString("pattern");
messageForm.applyPattern(pattern);

Format[] formats = {choiceForm, null, NumberFormat.getInstance()};
messageForm.setFormats(formats);

Object[] messageArguments = {null, "XDisk", null};
for (int numFiles = 0; numFiles < 4; numFiles++) {
    messageArguments[0] = new Integer(numFiles);
    messageArguments[2] = new Integer(numFiles);
    String result = messageForm.format(messageArguments);
    System.out.println(result);
}

Python

Check 6.27 gettext -- Multilingual internationalization services. The gettext module provides internationalization (I18N) and localization (L10N) services for your Python modules and applications. It supports both the GNU gettext message catalog API and a higher level, class-based API that may be more appropriate for Python files. The function calls, file formats, and methodologies are the same as gettext().

Conclusion

GNU's gettext provides the most comprehensive support for i18n, and many languages/libraries actually rely on that framework. Several features are worth mentioning:

  • Markers: all APIs provide functions to retrieve the translation of a token out of a message catalog or a resource file. The token itself comes in two flavors: Java and MFC use C-style IDs to refer to the corresponding translatable/translated strings (for ex: IDS_HELLO), whereas GNU's gettext as well as Tk use the translatable strings themselves as identifiers (for ex: "Hello, world!"). The former requires an explicit message catalog for the native language, whereas the later can use the translatable string itself if no translation is found, as a default fallback to the native language (here, English). In other word, the later can be used to code a project without worrying about i18n just yet, since the translation call would pretty much be a no-op. The second method seems to make it easier for developpers and translators to work on different sides of the project: developers need not to bother about creating unique IDs, and can just keep on adding strings using a native language directly. The code is also less obfuscated that way, since the translatable string itself provides some information about the code, whereas a C-style ID may require going back and forth between the code and the message catalog. On the other hand, translatable strings as IDs require the translatable strings to be specified in each message catalog, and therefore involves more sophisticated extraction and updating tools.

MFC, Main.rc:

STRINGTABLE 
BEGIN
    IDS_HELLO               "Hello World!"
END

MFC, Main.c:

LoadString(hSatDLL, IDS_HELLO, szHello, MAX_LOADSTRING);
DrawText(hdc, szHello, _tcslen(szHello), &rt, DT_LEFT);

gettext, fr.po:

#: hello.c:31
msgid "Hello, world!"
msgstr "Bonjour, le monde!"

gettext, hello.c:

printf ("%s\n", _("Hello, world!"));
  • Extraction: all gettext-based APIs provide tools to extract marked messages (translatable strings) from a set of C/C++ files and initializes a message catalog (be it PO file, .msg file, or .ts file) with empty translations. Message files are meant to be read and edited by humans, and associate each original, translatable string with its translation in a particular target language. Qt's provide a pretty elaborated UI front-end called Qt's Linguist. Tk, Java do not seem to provide any such tools and require the user to keep all catalogs in sync (which is made a little easier for MFC and Java since they use C style IDs to refer to translatable string). In any case, if an extraction tool sounds easy to write, an update tool will probably require more smarts but will be as important (say, if the code has changed/moved, how do you link the old translated string back to the translatable string).
  • Context: some APIs require a single message catalog per application. Other allow a message catalog per DLLs/library. Other allow resource files to refer to/include other resources files, recursively. In many cases, it boils down to the fact that the IDs of the translatable string can collide, especially if they are represented by the translatable string themselves and the translatable words are short or common (say, "Open" in a File menu). Java and MFC are more or less immune to this since they use C-style IDs. GNU's gettext solves part of this problem by associating 'domains' to catalogs (i.e., you can specifiy one "Open" in a library domain, and another one in an application domain). One can also choose to artificially enlengthen the translatable string and uses only the last part of the string as the default translated string if no translation is found (say, "Menu|File|Open" will default to "Open"). Qt goes the extra-mile by automatically providing a context to each translatable string corresponding to the class where the string was found. An optional "extra-context" string can also be provided (say, you have two "Enabled" buttons in two frames coded in the same class, one could use tr("Enabled", "Color frame") and tr("Enabled", "Hue frame") to differentiate them).
  • Compilation: some APIs require the message catalogs or resource files to be compiled into a binary form for faster querying/retrieval. These compiled catalogs are loaded at run-time (GNU's gettext) and therefore require them to be installed/bundled, or linked directly into the application (MFC). Some APIs can work with the native text-only resource directly (Tk, and to some extent Java). Since we have tried so far at Kitware to include our resources directly into the binary (by transforming images into C headers for example), we should probably go that path too and convert the message catalogs into a single header file.
  • Compounds-arguments: GNU's gettext and Java provide support for positional parameters inside translatable strings. Java uses a quite verbose scheme, GNU's gettext relies on a POSIX/XSI extension to printf() such as "Only %2$d bytes free on '%1$s'." is semantically equivalent to "'%s' has only %d bytes free.". Such extension was not supported by Microsoft compilers until very recently with Visual C++ 2005's _printf_p, but it could be emulated. In any case, it is an important feature since one can not expect the parameters in the translatable string to come in the same order in the translated string.
  • Plural forms: arguably a fairly complex feature as plural forms can vary depending on the number of items (say 0, 1, 2, a range, etc). At the end of the day, it just can not be simplified to appending a character at the end of a singular form: the handling of plural forms differs widely between the language families. Tk and MFC do not seem to offer much here. Java has a pretty sophisticated yet verbose set of methods to deal with the problem. GNU's gettext (and the APIs using gettext) also offers a sophisticated set of functions to handle plural forms and they seem reasonable to implement.



KWWidgets: [Welcome | Site Map]