diff options
author | Bruno Haible <bruno@clisp.org> | 2001-04-19 18:37:49 +0000 |
---|---|---|
committer | Bruno Haible <bruno@clisp.org> | 2001-04-19 18:37:49 +0000 |
commit | 41ae7394123b8b0b13eb37ff1e9ed325374bbeba (patch) | |
tree | 264af71280c09f69da43c6ad6ced7ba6633af09d /doc/gettext_9.html | |
parent | 6ae7c1b1ba9f4f3aa1d78f121bc7ded6692d5f46 (diff) | |
download | external_gettext-41ae7394123b8b0b13eb37ff1e9ed325374bbeba.zip external_gettext-41ae7394123b8b0b13eb37ff1e9ed325374bbeba.tar.gz external_gettext-41ae7394123b8b0b13eb37ff1e9ed325374bbeba.tar.bz2 |
Automatically generated from gettext.texi.
Diffstat (limited to 'doc/gettext_9.html')
-rw-r--r-- | doc/gettext_9.html | 1410 |
1 files changed, 1410 insertions, 0 deletions
diff --git a/doc/gettext_9.html b/doc/gettext_9.html new file mode 100644 index 0000000..00592a1 --- /dev/null +++ b/doc/gettext_9.html @@ -0,0 +1,1410 @@ +<HTML> +<HEAD> +<!-- This HTML file has been created by texi2html 1.51 + from gettext.texi on 19 April 2001 --> + +<TITLE>GNU gettext utilities - 9 The Programmer's View</TITLE> +</HEAD> +<BODY> +Go to the <A HREF="gettext_1.html">first</A>, <A HREF="gettext_8.html">previous</A>, <A HREF="gettext_10.html">next</A>, <A HREF="gettext_14.html">last</A> section, <A HREF="gettext_toc.html">table of contents</A>. +<P><HR><P> + + +<H1><A NAME="SEC41" HREF="gettext_toc.html#TOC41">9 The Programmer's View</A></H1> + +<P> +One aim of the current message catalog implementation provided by +GNU <CODE>gettext</CODE> was to use the systems message catalog handling, if the +installer wishes to do so. So we perhaps should first take a look at +the solutions we know about. The people in the POSIX committee did not +manage to agree on one of the semi-official standards which we'll +describe below. In fact they couldn't agree on anything, so they decided +only to include an example of an interface. The major Unix vendors +are split in the usage of the two most important specifications: X/Open's +catgets vs. Uniforum's gettext interface. We'll describe them both and +later explain our solution of this dilemma. + +</P> + + + +<H2><A NAME="SEC42" HREF="gettext_toc.html#TOC42">9.1 About <CODE>catgets</CODE></A></H2> + +<P> +The <CODE>catgets</CODE> implementation is defined in the X/Open Portability +Guide, Volume 3, XSI Supplementary Definitions, Chapter 5. But the +process of creating this standard seemed to be too slow for some of +the Unix vendors so they created their implementations on preliminary +versions of the standard. Of course this leads again to problems while +writing platform independent programs: even the usage of <CODE>catgets</CODE> +does not guarantee a unique interface. + +</P> +<P> +Another, personal comment on this that only a bunch of committee members +could have made this interface. They never really tried to program +using this interface. It is a fast, memory-saving implementation, an +user can happily live with it. But programmers hate it (at least me and +some others do...) + +</P> +<P> +But we must not forget one point: after all the trouble with transfering +the rights on Unix(tm) they at last came to X/Open, the very same who +published this specification. This leads me to making the prediction +that this interface will be in future Unix standards (e.g. Spec1170) and +therefore part of all Unix implementation (implementations, which are +<EM>allowed</EM> to wear this name). + +</P> + + + +<H3><A NAME="SEC43" HREF="gettext_toc.html#TOC43">9.1.1 The Interface</A></H3> + +<P> +The interface to the <CODE>catgets</CODE> implementation consists of three +functions which correspond to those used in file access: <CODE>catopen</CODE> +to open the catalog for using, <CODE>catgets</CODE> for accessing the message +tables, and <CODE>catclose</CODE> for closing after work is done. Prototypes +for the functions and the needed definitions are in the +<CODE><nl_types.h></CODE> header file. + +</P> +<P> +<CODE>catopen</CODE> is used like in this: + +</P> + +<PRE> +nl_catd catd = catopen ("catalog_name", 0); +</PRE> + +<P> +The function takes as the argument the name of the catalog. This usual +refers to the name of the program or the package. The second parameter +is not further specified in the standard. I don't even know whether it +is implemented consistently among various systems. So the common advice +is to use <CODE>0</CODE> as the value. The return value is a handle to the +message catalog, equivalent to handles to file returned by <CODE>open</CODE>. + +</P> +<P> +This handle is of course used in the <CODE>catgets</CODE> function which can +be used like this: + +</P> + +<PRE> +char *translation = catgets (catd, set_no, msg_id, "original string"); +</PRE> + +<P> +The first parameter is this catalog descriptor. The second parameter +specifies the set of messages in this catalog, in which the message +described by <CODE>msg_id</CODE> is obtained. <CODE>catgets</CODE> therefore uses a +three-stage addressing: + +</P> + +<PRE> +catalog name => set number => message ID => translation +</PRE> + +<P> +The fourth argument is not used to address the translation. It is given +as a default value in case when one of the addressing stages fail. One +important thing to remember is that although the return type of catgets +is <CODE>char *</CODE> the resulting string <EM>must not</EM> be changed. It +should better be <CODE>const char *</CODE>, but the standard is published in +1988, one year before ANSI C. + +</P> +<P> +The last of these function functions is used and behaves as expected: + +</P> + +<PRE> +catclose (catd); +</PRE> + +<P> +After this no <CODE>catgets</CODE> call using the descriptor is legal anymore. + +</P> + + +<H3><A NAME="SEC44" HREF="gettext_toc.html#TOC44">9.1.2 Problems with the <CODE>catgets</CODE> Interface?!</A></H3> + +<P> +Now that this description seemed to be really easy -- where are the +problem we speak of? In fact the interface could be used in a +reasonable way, but constructing the message catalogs is a pain. The +reason for this lies in the third argument of <CODE>catgets</CODE>: the unique +message ID. This has to be a numeric value for all messages in a single +set. Perhaps you could imagine the problems keeping such a list while +changing the source code. Add a new message here, remove one there. Of +course there have been developed a lot of tools helping to organize this +chaos but one as the other fails in one aspect or the other. We don't +want to say that the other approach has no problems but they are far +more easy to manage. + +</P> + + +<H2><A NAME="SEC45" HREF="gettext_toc.html#TOC45">9.2 About <CODE>gettext</CODE></A></H2> + +<P> +The definition of the <CODE>gettext</CODE> interface comes from a Uniforum +proposal and it is followed by at least one major Unix vendor +(Sun) in its last developments. It is not specified in any official +standard, though. + +</P> +<P> +The main points about this solution is that it does not follow the +method of normal file handling (open-use-close) and that it does not +burden the programmer so many task, especially the unique key handling. +Of course here is also a unique key needed, but this key is the message +itself (how long or short it is). See section <A HREF="gettext_9.html#SEC53">9.3 Comparing the Two Interfaces</A> for a more +detailed comparison of the two methods. + +</P> +<P> +The following section contains a rather detailed description of the +interface. We make it that detailed because this is the interface +we chose for the GNU <CODE>gettext</CODE> Library. Programmers interested +in using this library will be interested in this description. + +</P> + + + +<H3><A NAME="SEC46" HREF="gettext_toc.html#TOC46">9.2.1 The Interface</A></H3> + +<P> +The minimal functionality an interface must have is a) to select a +domain the strings are coming from (a single domain for all programs is +not reasonable because its construction and maintenance is difficult, +perhaps impossible) and b) to access a string in a selected domain. + +</P> +<P> +This is principally the description of the <CODE>gettext</CODE> interface. It +has a global domain which unqualified usages reference. Of course this +domain is selectable by the user. + +</P> + +<PRE> +char *textdomain (const char *domain_name); +</PRE> + +<P> +This provides the possibility to change or query the current status of +the current global domain of the <CODE>LC_MESSAGE</CODE> category. The +argument is a null-terminated string, whose characters must be legal in +the use in filenames. If the <VAR>domain_name</VAR> argument is <CODE>NULL</CODE>, +the function return the current value. If no value has been set +before, the name of the default domain is returned: <EM>messages</EM>. +Please note that although the return value of <CODE>textdomain</CODE> is of +type <CODE>char *</CODE> no changing is allowed. It is also important to know +that no checks of the availability are made. If the name is not +available you will see this by the fact that no translations are provided. + +</P> +<P> +To use a domain set by <CODE>textdomain</CODE> the function + +</P> + +<PRE> +char *gettext (const char *msgid); +</PRE> + +<P> +is to be used. This is the simplest reasonable form one can imagine. +The translation of the string <VAR>msgid</VAR> is returned if it is available +in the current domain. If not available the argument itself is +returned. If the argument is <CODE>NULL</CODE> the result is undefined. + +</P> +<P> +One things which should come into mind is that no explicit dependency to +the used domain is given. The current value of the domain for the +<CODE>LC_MESSAGES</CODE> locale is used. If this changes between two +executions of the same <CODE>gettext</CODE> call in the program, both calls +reference a different message catalog. + +</P> +<P> +For the easiest case, which is normally used in internationalized +packages, once at the beginning of execution a call to <CODE>textdomain</CODE> +is issued, setting the domain to a unique name, normally the package +name. In the following code all strings which have to be translated are +filtered through the gettext function. That's all, the package speaks +your language. + +</P> + + +<H3><A NAME="SEC47" HREF="gettext_toc.html#TOC47">9.2.2 Solving Ambiguities</A></H3> + +<P> +While this single name domain works well for most applications there +might be the need to get translations from more than one domain. Of +course one could switch between different domains with calls to +<CODE>textdomain</CODE>, but this is really not convenient nor is it fast. A +possible situation could be one case subject to discussion during this +writing: all +error messages of functions in the set of common used functions should +go into a separate domain <CODE>error</CODE>. By this mean we would only need +to translate them once. +Another case are messages from a library, as these <EM>have</EM> to be +independent of the current domain set by the application. + +</P> +<P> +For this reasons there are two more functions to retrieve strings: + +</P> + +<PRE> +char *dgettext (const char *domain_name, const char *msgid); +char *dcgettext (const char *domain_name, const char *msgid, + int category); +</PRE> + +<P> +Both take an additional argument at the first place, which corresponds +to the argument of <CODE>textdomain</CODE>. The third argument of +<CODE>dcgettext</CODE> allows to use another locale but <CODE>LC_MESSAGES</CODE>. +But I really don't know where this can be useful. If the +<VAR>domain_name</VAR> is <CODE>NULL</CODE> or <VAR>category</VAR> has an value beside +the known ones, the result is undefined. It should also be noted that +this function is not part of the second known implementation of this +function family, the one found in Solaris. + +</P> +<P> +A second ambiguity can arise by the fact, that perhaps more than one +domain has the same name. This can be solved by specifying where the +needed message catalog files can be found. + +</P> + +<PRE> +char *bindtextdomain (const char *domain_name, + const char *dir_name); +</PRE> + +<P> +Calling this function binds the given domain to a file in the specified +directory (how this file is determined follows below). Especially a +file in the systems default place is not favored against the specified +file anymore (as it would be by solely using <CODE>textdomain</CODE>). A +<CODE>NULL</CODE> pointer for the <VAR>dir_name</VAR> parameter returns the binding +associated with <VAR>domain_name</VAR>. If <VAR>domain_name</VAR> itself is +<CODE>NULL</CODE> nothing happens and a <CODE>NULL</CODE> pointer is returned. Here +again as for all the other functions is true that none of the return +value must be changed! + +</P> +<P> +It is important to remember that relative path names for the +<VAR>dir_name</VAR> parameter can be trouble. Since the path is always +computed relative to the current directory different results will be +achieved when the program executes a <CODE>chdir</CODE> command. Relative +paths should always be avoided to avoid dependencies and +unreliabilities. + +</P> + + +<H3><A NAME="SEC48" HREF="gettext_toc.html#TOC48">9.2.3 Locating Message Catalog Files</A></H3> + +<P> +Because many different languages for many different packages have to be +stored we need some way to add these information to file message catalog +files. The way usually used in Unix environments is have this encoding +in the file name. This is also done here. The directory name given in +<CODE>bindtextdomain</CODE>s second argument (or the default directory), +followed by the value and name of the locale and the domain name are +concatenated: + +</P> + +<PRE> +<VAR>dir_name</VAR>/<VAR>locale</VAR>/LC_<VAR>category</VAR>/<VAR>domain_name</VAR>.mo +</PRE> + +<P> +The default value for <VAR>dir_name</VAR> is system specific. For the GNU +library, and for packages adhering to its conventions, it's: + +<PRE> +/usr/local/share/locale +</PRE> + +<P> +<VAR>locale</VAR> is the value of the locale whose name is this +<CODE>LC_<VAR>category</VAR></CODE>. For <CODE>gettext</CODE> and <CODE>dgettext</CODE> this +<CODE>LC_<VAR>category</VAR></CODE> is always <CODE>LC_MESSAGES</CODE>.<A NAME="DOCF3" HREF="gettext_foot.html#FOOT3">(3)</A> +The value of the locale is determined through +<CODE>setlocale (LC_<VAR>category</VAR>, NULL)</CODE>. +<A NAME="DOCF4" HREF="gettext_foot.html#FOOT4">(4)</A> +<CODE>dcgettext</CODE> specifies the locale category by the third argument. + +</P> + + +<H3><A NAME="SEC49" HREF="gettext_toc.html#TOC49">9.2.4 How to specify the output character set <CODE>gettext</CODE> uses</A></H3> + +<P> +<CODE>gettext</CODE> not only looks up a translation in a message catalog. It +also converts the translation on the fly to the desired output character +set. This is useful if the user is working in a different character set +than the translator who created the message catalog, because it avoids +distributing variants of message catalogs which differ only in the +character set. + +</P> +<P> +The output character set is, by default, the value of <CODE>nl_langinfo +(CODESET)</CODE>, which depends on the <CODE>LC_CTYPE</CODE> part of the current +locale. But programs which store strings in a locale independent way +(e.g. UTF-8) can request that <CODE>gettext</CODE> and related functions +return the translations in that encoding, by use of the +<CODE>bind_textdomain_codeset</CODE> function. + +</P> +<P> +Note that the <VAR>msgid</VAR> argument to <CODE>gettext</CODE> is not subject to +character set conversion. Also, when <CODE>gettext</CODE> does not find a +translation for <VAR>msgid</VAR>, it returns <VAR>msgid</VAR> unchanged -- +independently of the current output character set. It is therefore +recommended that all <VAR>msgid</VAR>s be US-ASCII strings. + +</P> +<P> +<DL> +<DT><U>Function:</U> char * <B>bind_textdomain_codeset</B> <I>(const char *<VAR>domainname</VAR>, const char *<VAR>codeset</VAR>)</I> +<DD><A NAME="IDX1"></A> +The <CODE>bind_textdomain_codeset</CODE> function can be used to specify the +output character set for message catalogs for domain <VAR>domainname</VAR>. +The <VAR>codeset</VAR> argument must be a valid codeset name which can be used +for the <CODE>iconv_open</CODE> function, or a null pointer. + +</P> +<P> +If the <VAR>codeset</VAR> parameter is the null pointer, +<CODE>bind_textdomain_codeset</CODE> returns the currently selected codeset +for the domain with the name <VAR>domainname</VAR>. It returns <CODE>NULL</CODE> if +no codeset has yet been selected. + +</P> +<P> +The <CODE>bind_textdomain_codeset</CODE> function can be used several times. +If used multiple times with the same <VAR>domainname</VAR> argument, the +later call overrides the settings made by the earlier one. + +</P> +<P> +The <CODE>bind_textdomain_codeset</CODE> function returns a pointer to a +string containing the name of the selected codeset. The string is +allocated internally in the function and must not be changed by the +user. If the system went out of core during the execution of +<CODE>bind_textdomain_codeset</CODE>, the return value is <CODE>NULL</CODE> and the +global variable <VAR>errno</VAR> is set accordingly. +</DL> + +</P> + + +<H3><A NAME="SEC50" HREF="gettext_toc.html#TOC50">9.2.5 Additional functions for plural forms</A></H3> + +<P> +The functions of the <CODE>gettext</CODE> family described so far (and all the +<CODE>catgets</CODE> functions as well) have one problem in the real world +which have been neglected completely in all existing approaches. What +is meant here is the handling of plural forms. + +</P> +<P> +Looking through Unix source code before the time anybody thought about +internationalization (and, sadly, even afterwards) one can often find +code similar to the following: + +</P> + +<PRE> + printf ("%d file%s deleted", n, n == 1 ? "" : "s"); +</PRE> + +<P> +After the first complaints from people internationalizing the code people +either completely avoided formulations like this or used strings like +<CODE>"file(s)"</CODE>. Both look unnatural and should be avoided. First +tries to solve the problem correctly looked like this: + +</P> + +<PRE> + if (n == 1) + printf ("%d file deleted", n); + else + printf ("%d files deleted", n); +</PRE> + +<P> +But this does not solve the problem. It helps languages where the +plural form of a noun is not simply constructed by adding an `s' but +that is all. Once again people fell into the trap of believing the +rules their language is using are universal. But the handling of plural +forms differs widely between the language families. For example, +Rafal Maszkowski <CODE><rzm@mat.uni.torun.pl></CODE> reports: + +</P> + +<BLOCKQUOTE> +<P> +In Polish we use e.g. plik (file) this way: + +<PRE> +1 plik +2,3,4 pliki +5-21 pliko'w +22-24 pliki +25-31 pliko'w +</PRE> + +<P> +and so on (o' means 8859-2 oacute which should be rather okreska, +similar to aogonek). +</BLOCKQUOTE> + +<P> +There are two things which can differ between languages (and even inside +language families); + +</P> + +<UL> +<LI> + +The form how plural forms are build differs. This is a problem with +languages which have many irregularities. German, for instance, is a +drastic case. Though English and German are part of the same language +family (Germanic), the almost regular forming of plural noun forms +(appending an `s') is hardly found in German. + +<LI> + +The number of plural forms differ. This is somewhat surprising for +those who only have experiences with Romanic and Germanic languages +since here the number is the same (there are two). + +But other language families have only one form or many forms. More +information on this in an extra section. +</UL> + +<P> +The consequence of this is that application writers should not try to +solve the problem in their code. This would be localization since it is +only usable for certain, hardcoded language environments. Instead the +extended <CODE>gettext</CODE> interface should be used. + +</P> +<P> +These extra functions are taking instead of the one key string two +strings and a numerical argument. The idea behind this is that using +the numerical argument and the first string as a key, the implementation +can select using rules specified by the translator the right plural +form. The two string arguments then will be used to provide a return +value in case no message catalog is found (similar to the normal +<CODE>gettext</CODE> behavior). In this case the rules for Germanic language +is used and it is assumed that the first string argument is the singular +form, the second the plural form. + +</P> +<P> +This has the consequence that programs without language catalogs can +display the correct strings only if the program itself is written using +a Germanic language. This is a limitation but since the GNU C library +(as well as the GNU <CODE>gettext</CODE> package) are written as part of the +GNU package and the coding standards for the GNU project require program +being written in English, this solution nevertheless fulfills its +purpose. + +</P> +<P> +<DL> +<DT><U>Function:</U> char * <B>ngettext</B> <I>(const char *<VAR>msgid1</VAR>, const char *<VAR>msgid2</VAR>, unsigned long int <VAR>n</VAR>)</I> +<DD><A NAME="IDX2"></A> +The <CODE>ngettext</CODE> function is similar to the <CODE>gettext</CODE> function +as it finds the message catalogs in the same way. But it takes two +extra arguments. The <VAR>msgid1</VAR> parameter must contain the singular +form of the string to be converted. It is also used as the key for the +search in the catalog. The <VAR>msgid2</VAR> parameter is the plural form. +The parameter <VAR>n</VAR> is used to determine the plural form. If no +message catalog is found <VAR>msgid1</VAR> is returned if <CODE>n == 1</CODE>, +otherwise <CODE>msgid2</CODE>. + +</P> +<P> +An example for the use of this function is: + +</P> + +<PRE> +printf (ngettext ("%d file removed", "%d files removed", n), n); +</PRE> + +<P> +Please note that the numeric value <VAR>n</VAR> has to be passed to the +<CODE>printf</CODE> function as well. It is not sufficient to pass it only to +<CODE>ngettext</CODE>. +</DL> + +</P> +<P> +<DL> +<DT><U>Function:</U> char * <B>dngettext</B> <I>(const char *<VAR>domain</VAR>, const char *<VAR>msgid1</VAR>, const char *<VAR>msgid2</VAR>, unsigned long int <VAR>n</VAR>)</I> +<DD><A NAME="IDX3"></A> +The <CODE>dngettext</CODE> is similar to the <CODE>dgettext</CODE> function in the +way the message catalog is selected. The difference is that it takes +two extra parameter to provide the correct plural form. These two +parameters are handled in the same way <CODE>ngettext</CODE> handles them. +</DL> + +</P> +<P> +<DL> +<DT><U>Function:</U> char * <B>dcngettext</B> <I>(const char *<VAR>domain</VAR>, const char *<VAR>msgid1</VAR>, const char *<VAR>msgid2</VAR>, unsigned long int <VAR>n</VAR>, int <VAR>category</VAR>)</I> +<DD><A NAME="IDX4"></A> +The <CODE>dcngettext</CODE> is similar to the <CODE>dcgettext</CODE> function in the +way the message catalog is selected. The difference is that it takes +two extra parameter to provide the correct plural form. These two +parameters are handled in the same way <CODE>ngettext</CODE> handles them. +</DL> + +</P> +<P> +Now, how do these functions solve the problem of the plural forms? +Without the input of linguists (which was not available) it was not +possible to determine whether there are only a few different forms in +which plural forms are formed or whether the number can increase with +every new supported language. + +</P> +<P> +Therefore the solution implemented is to allow the translator to specify +the rules of how to select the plural form. Since the formula varies +with every language this is the only viable solution except for +hardcoding the information in the code (which still would require the +possibility of extensions to not prevent the use of new languages). + +</P> +<P> +The information about the plural form selection has to be stored in the +header entry of the PO file (the one with the empty <CODE>msgid</CODE> string). +The plural form information looks like this: + +</P> + +<PRE> +Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1; +</PRE> + +<P> +The <CODE>nplurals</CODE> value must be a decimal number which specifies how +many different plural forms exist for this language. The string +following <CODE>plural</CODE> is an expression which is using the C language +syntax. Exceptions are that no negative numbers are allowed, numbers +must be decimal, and the only variable allowed is <CODE>n</CODE>. This +expression will be evaluated whenever one of the functions +<CODE>ngettext</CODE>, <CODE>dngettext</CODE>, or <CODE>dcngettext</CODE> is called. The +numeric value passed to these functions is then substituted for all uses +of the variable <CODE>n</CODE> in the expression. The resulting value then +must be greater or equal to zero and smaller than the value given as the +value of <CODE>nplurals</CODE>. + +</P> +<P> +The following rules are known at this point. The language with families +are listed. But this does not necessarily mean the information can be +generalized for the whole family (as can be easily seen in the table +below).<A NAME="DOCF5" HREF="gettext_foot.html#FOOT5">(5)</A>.} + +</P> +<DL COMPACT> + +<DT>Only one form: +<DD> +Some languages only require one single form. There is no distinction +between the singular and plural form. An appropriate header entry +would look like this: + + +<PRE> +Plural-Forms: nplurals=1; plural=0; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Finno-Ugric family +<DD> +Hungarian +<DT>Asian family +<DD> +Japanese +<DT>Turkic/Altaic family +<DD> +Turkish +</DL> + +<DT>Two forms, singular used for one only +<DD> +This is the form used in most existing programs since it is what English +is using. A header entry would look like this: + + +<PRE> +Plural-Forms: nplurals=2; plural=n != 1; +</PRE> + +(Note: this uses the feature of C expressions that boolean expressions +have to value zero or one.) + +Languages with this property include: + +<DL COMPACT> + +<DT>Germanic family +<DD> +Danish, Dutch, English, German, Norwegian, Swedish +<DT>Finno-Ugric family +<DD> +Estonian, Finnish +<DT>Latin/Greek family +<DD> +Greek +<DT>Semitic family +<DD> +Hebrew +<DT>Romanic family +<DD> +Italian, Spanish +<DT>Artificial +<DD> +Esperanto +</DL> + +<DT>Two forms, singular used for zero and one +<DD> +Exceptional case in the language family. The header entry would be: + + +<PRE> +Plural-Forms: nplurals=2; plural=n>1; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Romanic family +<DD> +French +</DL> + +<DT>Three forms, special cases for one and two +<DD> +The header entry would be: + + +<PRE> +Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Celtic +<DD> +Gaeilge +</DL> + +<DT>Three forms, special case for numbers ending in 1[2-9] +<DD> +The header entry would look like this: + + +<PRE> +Plural-Forms: nplurals=3; \ + plural=n%10==1 && n%100!=11 ? 0 : \ + n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Baltic family +<DD> +Lithuanian +</DL> + +<DT>Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4] +<DD> +The header entry would look like this: + + +<PRE> +Plural-Forms: nplurals=3; \ + plural=n%10==1 && n%100!=11 ? 0 : \ + n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Slavic family +<DD> +Czech, Russian, Slovak, Ukrainian +</DL> + +<DT>Three forms, special case for one and some numbers ending in 2, 3, or 4 +<DD> +The header entry would look like this: + + +<PRE> +Plural-Forms: nplurals=3; \ + plural=n==1 ? 0 : \ + n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2; +</PRE> + +(Continuation in the next line is possible.) + +Languages with this property include: + +<DL COMPACT> + +<DT>Slavic family +<DD> +Polish +</DL> + +<DT>Four forms, special case for one and all numbers ending in 02, 03, or 04 +<DD> +The header entry would look like this: + + +<PRE> +Plural-Forms: nplurals=4; \ + plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3; +</PRE> + +Languages with this property include: + +<DL COMPACT> + +<DT>Slavic family +<DD> +Slovenian +</DL> +</DL> + + + +<H3><A NAME="SEC51" HREF="gettext_toc.html#TOC51">9.2.6 How to use <CODE>gettext</CODE> in GUI programs</A></H3> + +<P> +One place where the <CODE>gettext</CODE> functions, if used normally, have big +problems is within programs with graphical user interfaces (GUIs). The +problem is that many of the strings which have to be translated are very +short. They have to appear in pull-down menus which restricts the +length. But strings which are not containing entire sentences or at +least large fragments of a sentence may appear in more than one +situation in the program but might have different translations. This is +especially true for the one-word strings which are frequently used in +GUI programs. + +</P> +<P> +As a consequence many people say that the <CODE>gettext</CODE> approach is +wrong and instead <CODE>catgets</CODE> should be used which indeed does not +have this problem. But there is a very simple and powerful method to +handle these kind of problems with the <CODE>gettext</CODE> functions. + +</P> +<P> +As as example consider the following fictional situation. A GUI program +has a menu bar with the following entries: + +</P> + +<PRE> ++------------+------------+--------------------------------------+ +| File | Printer | | ++------------+------------+--------------------------------------+ +| Open | | Select | +| New | | Open | ++----------+ | Connect | + +----------+ +</PRE> + +<P> +To have the strings <CODE>File</CODE>, <CODE>Printer</CODE>, <CODE>Open</CODE>, +<CODE>New</CODE>, <CODE>Select</CODE>, and <CODE>Connect</CODE> translated there has to be +at some point in the code a call to a function of the <CODE>gettext</CODE> +family. But in two places the string passed into the function would be +<CODE>Open</CODE>. The translations might not be the same and therefore we +are in the dilemma described above. + +</P> +<P> +One solution to this problem is to artificially enlengthen the strings +to make them unambiguous. But what would the program do if no +translation is available? The enlengthened string is not what should be +printed. So we should use a little bit modified version of the functions. + +</P> +<P> +To enlengthen the strings a uniform method should be used. E.g., in the +example above the strings could be chosen as + +</P> + +<PRE> +Menu|File +Menu|Printer +Menu|File|Open +Menu|File|New +Menu|Printer|Select +Menu|Printer|Open +Menu|Printer|Connect +</PRE> + +<P> +Now all the strings are different and if now instead of <CODE>gettext</CODE> +the following little wrapper function is used, everything works just +fine: + +</P> +<P> +<A NAME="IDX5"></A> + +<PRE> + char * + sgettext (const char *msgid) + { + char *msgval = gettext (msgid); + if (msgval == msgid) + msgval = strrchr (msgid, '|') + 1; + return msgval; + } +</PRE> + +<P> +What this little function does is to recognize the case when no +translation is available. This can be done very efficiently by a +pointer comparison since the return value is the input value. If there +is no translation we know that the input string is in the format we used +for the Menu entries and therefore contains a <CODE>|</CODE> character. We +simply search for the last occurrence of this character and return a +pointer to the character following it. That's it! + +</P> +<P> +If one now consistently uses the enlengthened string form and replaces +the <CODE>gettext</CODE> calls with calls to <CODE>sgettext</CODE> (this is normally +limited to very few places in the GUI implementation) then it is +possible to produce a program which can be internationalized. + +</P> +<P> +The other <CODE>gettext</CODE> functions (<CODE>dgettext</CODE>, <CODE>dcgettext</CODE> +and the <CODE>ngettext</CODE> equivalents) can and should have corresponding +functions as well which look almost identical, except for the parameters +and the call to the underlying function. + +</P> +<P> +Now there is of course the question why such functions do not exist in +the GNU gettext package? There are two parts of the answer to this question. + +</P> + +<UL> +<LI> + +They are easy to write and therefore can be provided by the project they +are used in. This is not an answer by itself and must be seen together +with the second part which is: + +<LI> + +There is no way the gettext package can contain a version which can work +everywhere. The problem is the selection of the character to separate +the prefix from the actual string in the enlenghtened string. The +examples above used <CODE>|</CODE> which is a quite good choice because it +resembles a notation frequently used in this context and it also is a +character not often used in message strings. + +But what if the character is used in message strings? Or if the chose +character is not available in the character set on the machine one +compiles (e.g., <CODE>|</CODE> is not required to exist for ISO C; this is +why the <TT>`iso646.h'</TT> file exists in ISO C programming environments). +</UL> + +<P> +There is only one more comment to be said. The wrapper function above +requires that the translations strings are not enlengthened themselves. +This is only logical. There is no need to disambiguate the strings +(since they are never used as keys for a search) and one also saves +quite some memory and disk space by doing this. + +</P> + + +<H3><A NAME="SEC52" HREF="gettext_toc.html#TOC52">9.2.7 Optimization of the *gettext functions</A></H3> + +<P> +At this point of the discussion we should talk about an advantage of the +GNU <CODE>gettext</CODE> implementation. Some readers might have pointed out +that an internationalized program might have a poor performance if some +string has to be translated in an inner loop. While this is unavoidable +when the string varies from one run of the loop to the other it is +simply a waste of time when the string is always the same. Take the +following example: + +</P> + +<PRE> +{ + while (...) + { + puts (gettext ("Hello world")); + } +} +</PRE> + +<P> +When the locale selection does not change between two runs the resulting +string is always the same. One way to use this is: + +</P> + +<PRE> +{ + str = gettext ("Hello world"); + while (...) + { + puts (str); + } +} +</PRE> + +<P> +But this solution is not usable in all situation (e.g. when the locale +selection changes) nor does it lead to legible code. + +</P> +<P> +For this reason, GNU <CODE>gettext</CODE> caches previous translation results. +When the same translation is requested twice, with no new message +catalogs being loaded in between, <CODE>gettext</CODE> will, the second time, +find the result through a single cache lookup. + +</P> + + +<H2><A NAME="SEC53" HREF="gettext_toc.html#TOC53">9.3 Comparing the Two Interfaces</A></H2> + +<P> +The following discussion is perhaps a little bit colored. As said +above we implemented GNU <CODE>gettext</CODE> following the Uniforum +proposal and this surely has its reasons. But it should show how we +came to this decision. + +</P> +<P> +First we take a look at the developing process. When we write an +application using NLS provided by <CODE>gettext</CODE> we proceed as always. +Only when we come to a string which might be seen by the users and thus +has to be translated we use <CODE>gettext("...")</CODE> instead of +<CODE>"..."</CODE>. At the beginning of each source file (or in a central +header file) we define + +</P> + +<PRE> +#define gettext(String) (String) +</PRE> + +<P> +Even this definition can be avoided when the system supports the +<CODE>gettext</CODE> function in its C library. When we compile this code the +result is the same as if no NLS code is used. When you take a look at +the GNU <CODE>gettext</CODE> code you will see that we use <CODE>_("...")</CODE> +instead of <CODE>gettext("...")</CODE>. This reduces the number of +additional characters per translatable string to <EM>3</EM> (in words: +three). + +</P> +<P> +When now a production version of the program is needed we simply replace +the definition + +</P> + +<PRE> +#define _(String) (String) +</PRE> + +<P> +by + +</P> + +<PRE> +#include <libintl.h> +#define _(String) gettext (String) +</PRE> + +<P> +Additionally we run the program <TT>`xgettext'</TT> on all source code file +which contain translatable strings and that's it: we have a running +program which does not depend on translations to be available, but which +can use any that becomes available. + +</P> +<P> +The same procedure can be done for the <CODE>gettext_noop</CODE> invocations +(see section <A HREF="gettext_3.html#SEC18">3.5 Special Cases of Translatable Strings</A>). One usually defines <CODE>gettext_noop</CODE> as a +no-op macro. So you should consider the following code for your project: + +</P> + +<PRE> +#define gettext_noop(String) (String) +#define N_(String) gettext_noop (String) +</PRE> + +<P> +<CODE>N_</CODE> is a short form similar to <CODE>_</CODE>. The <TT>`Makefile'</TT> in +the <TT>`po/'</TT> directory of GNU <CODE>gettext</CODE> knows by default both of the +mentioned short forms so you are invited to follow this proposal for +your own ease. + +</P> +<P> +Now to <CODE>catgets</CODE>. The main problem is the work for the +programmer. Every time he comes to a translatable string he has to +define a number (or a symbolic constant) which has also be defined in +the message catalog file. He also has to take care for duplicate +entries, duplicate message IDs etc. If he wants to have the same +quality in the message catalog as the GNU <CODE>gettext</CODE> program +provides he also has to put the descriptive comments for the strings and +the location in all source code files in the message catalog. This is +nearly a Mission: Impossible. + +</P> +<P> +But there are also some points people might call advantages speaking for +<CODE>catgets</CODE>. If you have a single word in a string and this string +is used in different contexts it is likely that in one or the other +language the word has different translations. Example: + +</P> + +<PRE> +printf ("%s: %d", gettext ("number"), number_of_errors) + +printf ("you should see %d %s", number_count, + number_count == 1 ? gettext ("number") : gettext ("numbers")) +</PRE> + +<P> +Here we have to translate two times the string <CODE>"number"</CODE>. Even +if you do not speak a language beside English it might be possible to +recognize that the two words have a different meaning. In German the +first appearance has to be translated to <CODE>"Anzahl"</CODE> and the second +to <CODE>"Zahl"</CODE>. + +</P> +<P> +Now you can say that this example is really esoteric. And you are +right! This is exactly how we felt about this problem and decide that +it does not weight that much. The solution for the above problem could +be very easy: + +</P> + +<PRE> +printf ("%s %d", gettext ("number:"), number_of_errors) + +printf (number_count == 1 ? gettext ("you should see %d number") + : gettext ("you should see %d numbers"), + number_count) +</PRE> + +<P> +We believe that we can solve all conflicts with this method. If it is +difficult one can also consider changing one of the conflicting string a +little bit. But it is not impossible to overcome. + +</P> +<P> +<CODE>catgets</CODE> allows same original entry to have different translations, +but <CODE>gettext</CODE> has another, scalable approach for solving ambiguities +of this kind: See section <A HREF="gettext_9.html#SEC47">9.2.2 Solving Ambiguities</A>. + +</P> + + +<H2><A NAME="SEC54" HREF="gettext_toc.html#TOC54">9.4 Using libintl.a in own programs</A></H2> + +<P> +Starting with version 0.9.4 the library <CODE>libintl.h</CODE> should be +self-contained. I.e., you can use it in your own programs without +providing additional functions. The <TT>`Makefile'</TT> will put the header +and the library in directories selected using the <CODE>$(prefix)</CODE>. + +</P> +<P> +One exception of the above is found on HP-UX 10.01 systems. Here the C +library does not contain the <CODE>alloca</CODE> function (and the HP compiler +does not generate it inlined). But it is not intended to rewrite the whole +library just because of this dumb system. Instead include the +<CODE>alloca</CODE> function in all package you use the <CODE>libintl.a</CODE> in. + +</P> + + +<H2><A NAME="SEC55" HREF="gettext_toc.html#TOC55">9.5 Being a <CODE>gettext</CODE> grok</A></H2> + +<P> +To fully exploit the functionality of the GNU <CODE>gettext</CODE> library it +is surely helpful to read the source code. But for those who don't want +to spend that much time in reading the (sometimes complicated) code here +is a list comments: + +</P> + +<UL> +<LI>Changing the language at runtime + +For interactive programs it might be useful to offer a selection of the +used language at runtime. To understand how to do this one need to know +how the used language is determined while executing the <CODE>gettext</CODE> +function. The method which is presented here only works correctly +with the GNU implementation of the <CODE>gettext</CODE> functions. + +In the function <CODE>dcgettext</CODE> at every call the current setting of +the highest priority environment variable is determined and used. +Highest priority means here the following list with decreasing +priority: + + +<OL> +<LI><CODE>LANGUAGE</CODE> + +<LI><CODE>LC_ALL</CODE> + +<LI><CODE>LC_xxx</CODE>, according to selected locale + +<LI><CODE>LANG</CODE> + +</OL> + +Afterwards the path is constructed using the found value and the +translation file is loaded if available. + +What is now when the value for, say, <CODE>LANGUAGE</CODE> changes. According +to the process explained above the new value of this variable is found +as soon as the <CODE>dcgettext</CODE> function is called. But this also means +the (perhaps) different message catalog file is loaded. In other +words: the used language is changed. + +But there is one little hook. The code for gcc-2.7.0 and up provides +some optimization. This optimization normally prevents the calling of +the <CODE>dcgettext</CODE> function as long as no new catalog is loaded. But +if <CODE>dcgettext</CODE> is not called the program also cannot find the +<CODE>LANGUAGE</CODE> variable be changed (see section <A HREF="gettext_9.html#SEC52">9.2.7 Optimization of the *gettext functions</A>). A +solution for this is very easy. Include the following code in the +language switching function. + + +<PRE> + /* Change language. */ + setenv ("LANGUAGE", "fr", 1); + + /* Make change known. */ + { + extern int _nl_msg_cat_cntr; + ++_nl_msg_cat_cntr; + } +</PRE> + +The variable <CODE>_nl_msg_cat_cntr</CODE> is defined in <TT>`loadmsgcat.c'</TT>. +The programmer will find himself in need for a construct like this only +when developing programs which do run longer and provide the user to +select the language at runtime. Non-interactive programs (like all +these little Unix tools) should never need this. + +</UL> + + + +<H2><A NAME="SEC56" HREF="gettext_toc.html#TOC56">9.6 Temporary Notes for the Programmers Chapter</A></H2> + + + +<H3><A NAME="SEC57" HREF="gettext_toc.html#TOC57">9.6.1 Temporary - Two Possible Implementations</A></H3> + +<P> +There are two competing methods for language independent messages: +the X/Open <CODE>catgets</CODE> method, and the Uniforum <CODE>gettext</CODE> +method. The <CODE>catgets</CODE> method indexes messages by integers; the +<CODE>gettext</CODE> method indexes them by their English translations. +The <CODE>catgets</CODE> method has been around longer and is supported +by more vendors. The <CODE>gettext</CODE> method is supported by Sun, +and it has been heard that the COSE multi-vendor initiative is +supporting it. Neither method is a POSIX standard; the POSIX.1 +committee had a lot of disagreement in this area. + +</P> +<P> +Neither one is in the POSIX standard. There was much disagreement +in the POSIX.1 committee about using the <CODE>gettext</CODE> routines +vs. <CODE>catgets</CODE> (XPG). In the end the committee couldn't +agree on anything, so no messaging system was included as part +of the standard. I believe the informative annex of the standard +includes the XPG3 messaging interfaces, "...as an example of +a messaging system that has been implemented..." + +</P> +<P> +They were very careful not to say anywhere that you should use one +set of interfaces over the other. For more on this topic please +see the Programming for Internationalization FAQ. + +</P> + + +<H3><A NAME="SEC58" HREF="gettext_toc.html#TOC58">9.6.2 Temporary - About <CODE>catgets</CODE></A></H3> + +<P> +There have been a few discussions of late on the use of +<CODE>catgets</CODE> as a base. I think it important to present both +sides of the argument and hence am opting to play devil's advocate +for a little bit. + +</P> +<P> +I'll not deny the fact that <CODE>catgets</CODE> could have been designed +a lot better. It currently has quite a number of limitations and +these have already been pointed out. + +</P> +<P> +However there is a great deal to be said for consistency and +standardization. A common recurring problem when writing Unix +software is the myriad portability problems across Unix platforms. +It seems as if every Unix vendor had a look at the operating system +and found parts they could improve upon. Undoubtedly, these +modifications are probably innovative and solve real problems. +However, software developers have a hard time keeping up with all +these changes across so many platforms. + +</P> +<P> +And this has prompted the Unix vendors to begin to standardize their +systems. Hence the impetus for Spec1170. Every major Unix vendor +has committed to supporting this standard and every Unix software +developer waits with glee the day they can write software to this +standard and simply recompile (without having to use autoconf) +across different platforms. + +</P> +<P> +As I understand it, Spec1170 is roughly based upon version 4 of the +X/Open Portability Guidelines (XPG4). Because <CODE>catgets</CODE> and +friends are defined in XPG4, I'm led to believe that <CODE>catgets</CODE> +is a part of Spec1170 and hence will become a standardized component +of all Unix systems. + +</P> + + +<H3><A NAME="SEC59" HREF="gettext_toc.html#TOC59">9.6.3 Temporary - Why a single implementation</A></H3> + +<P> +Now it seems kind of wasteful to me to have two different systems +installed for accessing message catalogs. If we do want to remedy +<CODE>catgets</CODE> deficiencies why don't we try to expand <CODE>catgets</CODE> +(in a compatible manner) rather than implement an entirely new system. +Otherwise, we'll end up with two message catalog access systems installed +with an operating system - one set of routines for packages using GNU +<CODE>gettext</CODE> for their internationalization, and another set of routines +(catgets) for all other software. Bloated? + +</P> +<P> +Supposing another catalog access system is implemented. Which do +we recommend? At least for Linux, we need to attract as many +software developers as possible. Hence we need to make it as easy +for them to port their software as possible. Which means supporting +<CODE>catgets</CODE>. We will be implementing the <CODE>libintl</CODE> code +within our <CODE>libc</CODE>, but does this mean we also have to incorporate +another message catalog access scheme within our <CODE>libc</CODE> as well? +And what about people who are going to be using the <CODE>libintl</CODE> ++ non-<CODE>catgets</CODE> routines. When they port their software to +other platforms, they're now going to have to include the front-end +(<CODE>libintl</CODE>) code plus the back-end code (the non-<CODE>catgets</CODE> +access routines) with their software instead of just including the +<CODE>libintl</CODE> code with their software. + +</P> +<P> +Message catalog support is however only the tip of the iceberg. +What about the data for the other locale categories. They also have +a number of deficiencies. Are we going to abandon them as well and +develop another duplicate set of routines (should <CODE>libintl</CODE> +expand beyond message catalog support)? + +</P> +<P> +Like many parts of Unix that can be improved upon, we're stuck with balancing +compatibility with the past with useful improvements and innovations for +the future. + +</P> + + +<H3><A NAME="SEC60" HREF="gettext_toc.html#TOC60">9.6.4 Temporary - Notes</A></H3> + +<P> +X/Open agreed very late on the standard form so that many +implementations differ from the final form. Both of my system (old +Linux catgets and Ultrix-4) have a strange variation. + +</P> +<P> +OK. After incorporating the last changes I have to spend some time on +making the GNU/Linux <CODE>libc</CODE> <CODE>gettext</CODE> functions. So in future +Solaris is not the only system having <CODE>gettext</CODE>. + +</P> +<P><HR><P> +Go to the <A HREF="gettext_1.html">first</A>, <A HREF="gettext_8.html">previous</A>, <A HREF="gettext_10.html">next</A>, <A HREF="gettext_14.html">last</A> section, <A HREF="gettext_toc.html">table of contents</A>. +</BODY> +</HTML> |