International Support mini-HOWTO 2.2.2 - June 29 2000 Jim Hall, _____________________________________________________________________ 0. ABOUT INTERNATIONAL SUPPORT Most programmers never think about their programs being used by users in other countries, or that the program would be unusable by someone who does not speak English. By the true nature of hackers, programs are first written to fill a personal need. To extend a program to support multiple languages is something that many programmers never consider. However, adding international support to your program makes it more useful. Now, your program becomes accessible to more people than just those who speak your native tongue. But how to add international support to your programs? This mini- HOWTO intends to show you what you need to do to add international support (also called "internationalization" or i18n) easily to your projects. Internationalization (for the end-user) is provided in traditional DOS systems in two ways: 1. Keyboard support (KEYB or XKEYB) to add keyboard definitions for international keyboards. 2. Code page support (CHCP) to change to a code page that specifically supports your language characters, if the default code page does not. These two methods are not enough for most applications. A third method was defined by ANSI C and POSIX to assist application developers write programs that support internationalization: 3. setlocale() to set the locale for your programs. This affects things like printf() and strcoll() But this does not help your program produce messages in a language that an international user can understand. To do that, you need a fourth method: 4. message catalogs There are several ways to do this (MSGLIB and "Cats" are only two of them.) The "Cats" interface is documented here. _____________________________________________________________________ 1. KEYBOARD (content needed) _____________________________________________________________________ 2. CODE PAGE (content needed) _____________________________________________________________________ 3. LOCALE A locale is a set of language and cultural rules. These cover aspects such as language for messages, different character sets, lexigraphic conventions, numerical formatting, etc. A program needs to be able to determine its locale and act accordingly to be portable to different cultures. Locale is defined by ANSI C and POSIX, and is not implemented by "Cats". This information is provided only for completeness. The locale consists of two basic interfaces: setlocale() to set your program's idea of its locale localeconv() get information about number formatting The function setlocale() has the following usage: char * setlocale (int category, char *locale); setlocale() sets the locale category to the new locale, and returns the value of the previous locale. There are different categories for local information a program might need (declared as macros.) Using them as the "category" argument to the setlocale() function, it is possible to set one of these to the desired locale: LC_ALL for all of the locale. LC_COLLATE for the functions strcoll() and strxfrm(). LC_CTYPE for the character classification and conversion routines. LC_MONETARY for localeconv(). LC_NUMERIC for the decimal character. LC_TIME for strftime(). NULL if the request cannot not be honored. This string may be allocated in static storage. If the "locale" argument to setlocale() is null, the default locale is (on many systems) determined using the following rules: 1. The environment variable LC_ALL, if available 2. If an environment variable with the same name as one of the categories above exists, its value is used for that category. 3. The environment variable LANG, if available Be aware that Borland C 3.1 (DOS) only supports the "C" locale, so invoking setlocale() will have no effect. Other compilers may or may not support multiple locales, or may "fake it" like BC31 does. Check your compiler's documentation. Now, let's use setlocale() in a program. In this example, we have a program that needs to print the value "1.5" to the user. In the English locale, you use a decimal point (".") to separate the integer from the fractional part of the number. In another locale (say, Spanish) you use a comma (",") as the separator. Making a call to setlocale() makes this transparent to the program: /* lc.c */ #include #include /* setlocale */ int main (void) { float x; x = 1.5; setlocale (LC_NUMERIC, ""); printf ("%5.2f\n", x); exit (0); } On a system that fully supports setlocale(), the output might look like this: C:\>lc 1.50 -or- C:\>SET LANG=es C:\>lc 1,50 A program may be made portable to all locales by calling setlocale(LC_ALL, "" ) after program initialization. _____________________________________________________________________ 4. MESSAGE CATALOGS A long time ago, I had written an implementation of UNIX catgets() based on an in-memory key-value database. This library is called "Cats" (message CATalog System), a DOS implementation of catgets(). "Cats" is available at http://www.isd.net/jhall1/freedos/cats For those who don't know about the UNIX catgets() function, you just use it to return a pointer to a localized string based on a message number. All your program's messages are stored in a file, with message numbers and set numbers. "Hello world" might be message 1 in set 1. Your copyright statement might be message 2 in set 1. "Failure writing to drive A:" might be message 4 in set 7. Before you use catgets() you must first open a message catalog file. You do this with the catopen() function: nl_catd catopen (char *catalog_file, int flags); "Cats" assumes environment variables to point to the location of the message catalogs, so that "Cats" uses the environment variables when you call catopen(). If the "catalog_file" parameter does not contain a directory separator ("\"), then NLSPATH is the directory in which you keep your message catalogs, and LANG is the country code abbreviation. This is similar to the UNIX implemenation. The "flags" parameter is ignored in this implementation of "Cats". The "flag" argument is used (on UNIX systems) to indicate the type of loading desired. This should be either MCLoadBySet or MCLoadAll, and control if only the required set from the catalog is loaded into memory on an as-needed basis or if catopen() should load the entire catalog into memory. catopen() returns a message catalog descriptor of type nl_catd on success. On failure, it returns -1. "Cats" can only have one message catalog open at a time. Once you have a message catalog available, you use catgets() like this: char * catgets (nl_catd catalog, int msg_set, int msg_num, char *default); That is, you tell catgets() to retrieve a message for you from the message catalog based on a set number and message number. catgets() returns a pointer to that string. If catgets can't find the set/message, it returns the default string that you passed it. The best way to learn how to use "Cats" is to download the library, and look at the sample programs in src/ Here's a sample C program to show you how to use it: /* fail.c */ #include #include "catgets.h" /* catopen/catgets */ int main (void) { char *s; nl_catd cat; /* catalog descriptor */ cat = catopen ("fail", MCLoadAll); /* MCLoadAll is ignored */ s = catgets (cat, 7, 4, "Failure writing to drive A:"); printf ("%s\n"); catclose (cat); exit (0); } The above sample program would have this execution: C:\>fail Failure writing to drive A: -or- C:\>SET LANG=es C:\>fail Incidente que escribe a A: My message catalogs are simple ascii, and look like this: (File=FAIL.EN) 1.1:Hello world 7.4:Failure writing to drive A: -or- (File=FAIL.ES) 1.1:Hola mundo 7.4:Incidente que escribe a A: Since message catalogs are plain ascii files, it will be easy for a user to take one "catalog" and create a translation that can *immediately* be used by another user (i.e. you don't need to "re-compile" the message catalog before you use it.) ********************************************************************* APPENDICES ********************************************************************* _____________________________________________________________________ A. OTHER SOURCES OF INFORMATION: MSGLIB Steffen also has a version of MSGLIB that does much the same thing as "Cats", but uses binary ("compiled") message catalogs. Steffen Kaiser writes this about MSGLIB: This is an offspring of the internationalization debate back in 1994 or 1995. The current release to be used is located at: ftp://ftp-fd.inf.fh-rhein-sieg.de/pub/local/ALPHA/msglib.zip A non-Alpha release is at: ftp://ftp-fd.inf.fh-rhein-sieg.de/pub/local/msglib31.zip It does support locally stored message strings only (see below for details). MSGLIB is desgined to overcome three problems: 1. where to retrieve the message strings from (msg retriever), 2. to tweak the message string that the function to display a string looks the same regardless what language is choosen (msg interpreter), 3. to attach a semantic when display a message (msg visualisor). Parts 1 and 2 can be independed on each other; part 3 joins both together. That means one could use the msg retriever stand-alone. The msg retriever currently supports two methods, a third is currently started, but postponed. Method #1 stores all msg strings statically in the program code. This was the easiest method to implement. Method #2 reads the msg strings from a file, what can be a stand-alone file, attached to the executable, even embedded into any data file. The "recommended" way (that means the way shown in exmples makes no use of the latter ability). Method #3 joins methods #1 and #2 together and tries to read a requested msg string from a file, on failure use an internally stored one -- this would be the same as the HP-style "catgets()". Currently MSGLIB does support the catgets() function already. But I never thought that to address a msg string by a number a good thing. Instead the user of MSGLIB assigns a symbol to the message, like "E_noMem" or "M_red". Regardless what method of the msg rertiever is in use, the statement: char *p; p = msgLock(E_noMem); makes sure that 'p' is pointing to the msg string stored in memory. To free resources one just calls: msgUnlock(E_noMem); -or- msgUnlockString(p); Method #2 of the msg retriever also supports that two or more languages are packed together into the same binary msg catalogue. When the msg retriever scans the msg catalogue the first time, it identifies the most fitting language. Because of lack of other source, I'll choosed the country ID as my language ID, however, a language ID is just a number and the only thing to change is the function that creates this number from the system resources. One thing not mentioned so far is that msg strings are viewable with certain codepages only. So besides the language ID MSGLIB associates a codepage ID with each msg string, which must match strickly. (Not included with any released version, yet.) The msg catalogues themselves have been designed to be useable for other stuff as well, just like the resource block in S Windows. The msg catalogue consists of a series of individual chunks, which are linked together so that one program need not know the internal format of each chunk, but can concentrate to pick the ones necessary for itself. The basic idea was to split msg strings into two categories: local and global. So often used msg strings are declared as "global" and are automatically available to all programs using MSGLIB without to write them down into the local msg catalogue. So the programmer can concentrate on just the program. The other three parts of MSGLIB are described in the included documentation (Postscript file). Within this doc there is a list of all functions and a very short description of them. (I really can't remember why the msg retriever is not described there?) I haven't received much response about MSGLIB, yet, so don't feel guilty if you don't like MSGLIB; however, it would be a pity if the missing/bad documentation or a "can't get this stuff to work" notion prevent an examination or critic. Why is MSGLIB currently postponed? I was struck by the problem that the command line passed to the message compiler became too short, in fact the last time I shortend the filenames to bare minimum to keep the compiler working. So I decided to make a command line and configuration file parser that takes away this ever-annoying problem from me. It is currently found in the pre-release of SUPPL (all that cfg*.c files); though it stagnates a bit, too, because I'm currently making all the missing docs besides comments in the source code for SUPPL (er, a thing most programmers really hate I've learnt in the past). Examples of programs using MSGLIB31 are SWSUBST and ASSIGN. _____________________________________________________________________ B. LANGUAGE/COUNTRY CODES Steffen Kaiser writes this about language/country codes: Defined in the context of internet/WWW/HTML are: * ISO-639 defines 2-letter codes for languages, though, there are not many. These codes are also used by the locale implementation of most traditional Unix systems. (http://jargo.itim.mi.cnr.it/documentazione/iso639_codes.html) * ISO-3166 describes country codes, 2-letter, 3-letter and numerical format. (http://jargo.itim.mi.cnr.it/documentazione/iso3166_codes.html) * For bibliographic purpose, a 3-letter code has been invented, I came across these ones a while back, but cannot remember where; I guess it was in the description of the computer index of some large library. But I found this ones: http://www.sil.org/ethnologue/names/ This index is sort of reversed, you need to know the correct and full name of the language and this index maps it to a 3-letter code; yeah, interessting how many officially registered different variants of "German" exist. The bibliographic codes are invented to be complete and even differ among (wider known) dialects of a language. There exists two "officially" used ways to express a language, one derives the code from the English word fo this language (what can be expressed with 7-bit ASCII symbols then) and one derived from the word used by language itself and then transcripted into Latin characters on order to be expressable with 7-bit ASCII. The "Cats" library doesn't really assume anything about the language code, except that it must be three or fewer letters. (In short, this is because the message catalog is located by using the NLSPATH and LANG environment variables. The LANG environment variable defines the language code, which is implemented as the file extension for the message catalog.) _____________________________________________________________________ FURTHER READING: Useful Links regarding software internationalization - http://dmoz.org/Computers/Software/Globalization/Internationalization/ - The Open Directory Project (at the dmoz site), whose goal is to produce the most comprehensive directory of the web, by relying on a vast army of volunteer editors. - http://www.lib.ox.ac.uk/internet/news/faq/by_category.internationalization.html - Oxford's I18n FAQ - http://www.unicode.org/ - UNICODE - http://www.w3.org/International/ - World Wide Web consortium: Non-western Character sets, Languages, and Writing Systems - http://anubis.dkuug.dk/maits/i18n - Standards for Internationalization - http://www.microsoft.com/globaldev - Microsoft's Software Globalization Information - http://www.microsoft.com/win32dev/uiguide/uigui445.htm - Microsoft's I18n Guidelines - http://gatekeeper.dec.com/pub/DEC/DECinfo/DTJ/v5n3/THE_XOPEN_INTERNATIONALIZATIO_01jan1994DTJB03SC.txt - THE X/OPEN INTERNATIONALIZATION MODEL - http://cns-web.bu.edu/pub/djohnson/web_files/i18n/i18n.html - Concepts of C/UNIX Internationalization (paper by Dave Johnson, Boston University) - http://java.sun.com:80/products/jdk/1.1/docs/guide/intl/index.html - Java I18n info from Sun - http://www.ibm.com/java/education/globalapps/ - IBM's Java Global Application Guide - http://www.ibm.com/java/education/international-text/index.html - More from IBM on internationalized text in Java 1.2 - http://www.digital.com/info/DTJB00/ - Digital Equipment technical paper on I18n (Feb. 1994) - http://www.cis.ohio-state.edu/hypertext/faq/usenet/internationalization/iso-8859-1-charset/faq.html - ISO 8859-1 National Character Set FAQ - http://www.stri.is/TC304/EURO/default.html - Information and Communication Technologies European Localization Requirements - http://www.gnu.org/manual/glibc-2.0.6/html_chapter/libc_19.html - GNU manual on setting the locale - http://www.gnu.org/manual/gettext-0.10.35/text/gettext.txt and http://www.gnu.org/manual/gettext/html_node/gettext_44.html - GNU manual about 'gettext' (GNU's library for retrieving localized strings from a message catalog.) Lots of info about internationalication and localization (including a discussion of what the two mean). - http://www.cs.ruu.nl/wais/html/na-dir/internationalization/programming-faq.html - Introduction to i18n programming. A MUST-READ! - http://www.leb.net/archives/reader/csi/0039.html - Programming for Internationalization FAQ - http://www.iso.ch/cate/cat.html - ISO Standards - http://www.czyborra.com/ - Unicode on Unix - http://www.hut.fi/u/jkorpela/chars.html - A tutorial on character code issues - http://www.w3.org/MarkUp/html-spec/charset-harmful.html - Character sets considered harmful - http://www.iro.umontreal.ca/~pinard/recode/ - GNU recode _____________________________________________________________________ HISTORY Version 1.0 was the FreeDOS International Support Mini-HOWTO, but I have changed this document to be more specific to the "Cats" library. I will include the other information at the end as a "Other Sources of Information."