diff options
Diffstat (limited to 'upstream/archlinux/man3/Locale::Maketext::TPJ13.3perl')
-rw-r--r-- | upstream/archlinux/man3/Locale::Maketext::TPJ13.3perl | 855 |
1 files changed, 855 insertions, 0 deletions
diff --git a/upstream/archlinux/man3/Locale::Maketext::TPJ13.3perl b/upstream/archlinux/man3/Locale::Maketext::TPJ13.3perl new file mode 100644 index 00000000..751b3478 --- /dev/null +++ b/upstream/archlinux/man3/Locale::Maketext::TPJ13.3perl @@ -0,0 +1,855 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "Locale::Maketext::TPJ13 3perl" +.TH Locale::Maketext::TPJ13 3perl 2024-02-11 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +Locale::Maketext::TPJ13 \-\- article about software localization +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 1 +\& # This an article, not a module. +.Ve +.SH DESCRIPTION +.IX Header "DESCRIPTION" +The following article by Sean M. Burke and Jordan Lachler +first appeared in \fIThe Perl Journal\fR #13 +and is copyright 1999 The Perl Journal. It appears +courtesy of Jon Orwant and The Perl Journal. This document may be +distributed under the same terms as Perl itself. +.SH "Localization and Perl: gettext breaks, Maketext fixes" +.IX Header "Localization and Perl: gettext breaks, Maketext fixes" +by Sean M. Burke and Jordan Lachler +.PP +This article points out cases where gettext (a common system for +localizing software interfaces \-\- i.e., making them work in the user's +language of choice) fails because of basic differences between human +languages. This article then describes Maketext, a new system capable +of correctly treating these differences. +.SS "A Localization Horror Story: It Could Happen To You" +.IX Subsection "A Localization Horror Story: It Could Happen To You" +.RS 4 +"There are a number of languages spoken by human beings in this +world." +.Sp +\&\-\- Harald Tveit Alvestrand, in RFC 1766, "Tags for the +Identification of Languages" +.RE +.PP +Imagine that your task for the day is to localize a piece of software +\&\-\- and luckily for you, the only output the program emits is two +messages, like this: +.PP +.Vb 1 +\& I scanned 12 directories. +\& +\& Your query matched 10 files in 4 directories. +.Ve +.PP +So how hard could that be? You look at the code that +produces the first item, and it reads: +.PP +.Vb 2 +\& printf("I scanned %g directories.", +\& $directory_count); +.Ve +.PP +You think about that, and realize that it doesn't even work right for +English, as it can produce this output: +.PP +.Vb 1 +\& I scanned 1 directories. +.Ve +.PP +So you rewrite it to read: +.PP +.Vb 5 +\& printf("I scanned %g %s.", +\& $directory_count, +\& $directory_count == 1 ? +\& "directory" : "directories", +\& ); +.Ve +.PP +\&...which does the Right Thing. (In case you don't recall, "%g" is for +locale-specific number interpolation, and "%s" is for string +interpolation.) +.PP +But you still have to localize it for all the languages you're +producing this software for, so you pull Locale::gettext off of CPAN +so you can access the \f(CW\*(C`gettext\*(C'\fR C functions you've heard are standard +for localization tasks. +.PP +And you write: +.PP +.Vb 5 +\& printf(gettext("I scanned %g %s."), +\& $dir_scan_count, +\& $dir_scan_count == 1 ? +\& gettext("directory") : gettext("directories"), +\& ); +.Ve +.PP +But you then read in the gettext manual (Drepper, Miller, and Pinard 1995) +that this is not a good idea, since how a single word like "directory" +or "directories" is translated may depend on context \-\- and this is +true, since in a case language like German or Russian, you'd may need +these words with a different case ending in the first instance (where the +word is the object of a verb) than in the second instance, which you haven't even +gotten to yet (where the word is the object of a preposition, "in \f(CW%g\fR +directories") \-\- assuming these keep the same syntax when translated +into those languages. +.PP +So, on the advice of the gettext manual, you rewrite: +.PP +.Vb 4 +\& printf( $dir_scan_count == 1 ? +\& gettext("I scanned %g directory.") : +\& gettext("I scanned %g directories."), +\& $dir_scan_count ); +.Ve +.PP +So, you email your various translators (the boss decides that the +languages du jour are Chinese, Arabic, Russian, and Italian, so you +have one translator for each), asking for translations for "I scanned +\&\f(CW%g\fR directory." and "I scanned \f(CW%g\fR directories.". When they reply, +you'll put that in the lexicons for gettext to use when it localizes +your software, so that when the user is running under the "zh" +(Chinese) locale, gettext("I scanned \f(CW%g\fR directory.") will return the +appropriate Chinese text, with a "%g" in there where printf can then +interpolate \f(CW$dir_scan\fR. +.PP +Your Chinese translator emails right back \-\- he says both of these +phrases translate to the same thing in Chinese, because, in linguistic +jargon, Chinese "doesn't have number as a grammatical category" \-\- +whereas English does. That is, English has grammatical rules that +refer to "number", i.e., whether something is grammatically singular +or plural; and one of these rules is the one that forces nouns to take +a plural suffix (generally "s") when in a plural context, as they are when +they follow a number other than "one" (including, oddly enough, "zero"). +Chinese has no such rules, and so has just the one phrase where English +has two. But, no problem, you can have this one Chinese phrase appear +as the translation for the two English phrases in the "zh" gettext +lexicon for your program. +.PP +Emboldened by this, you dive into the second phrase that your software +needs to output: "Your query matched 10 files in 4 directories.". You notice +that if you want to treat phrases as indivisible, as the gettext +manual wisely advises, you need four cases now, instead of two, to +cover the permutations of singular and plural on the two items, +\&\f(CW$dir_count\fR and \f(CW$file_count\fR. So you try this: +.PP +.Vb 9 +\& printf( $file_count == 1 ? +\& ( $directory_count == 1 ? +\& gettext("Your query matched %g file in %g directory.") : +\& gettext("Your query matched %g file in %g directories.") ) : +\& ( $directory_count == 1 ? +\& gettext("Your query matched %g files in %g directory.") : +\& gettext("Your query matched %g files in %g directories.") ), +\& $file_count, $directory_count, +\& ); +.Ve +.PP +(The case of "1 file in 2 [or more] directories" could, I suppose, +occur in the case of symlinking or something of the sort.) +.PP +It occurs to you that this is not the prettiest code you've ever +written, but this seems the way to go. You mail off to the +translators asking for translations for these four cases. The +Chinese guy replies with the one phrase that these all translate to in +Chinese, and that phrase has two "%g"s in it, as it should \-\- but +there's a problem. He translates it word-for-word back: "In \f(CW%g\fR +directories contains \f(CW%g\fR files match your query." The \f(CW%g\fR +slots are in an order reverse to what they are in English. You wonder +how you'll get gettext to handle that. +.PP +But you put it aside for the moment, and optimistically hope that the +other translators won't have this problem, and that their languages +will be better behaved \-\- i.e., that they will be just like English. +.PP +But the Arabic translator is the next to write back. First off, your +code for "I scanned \f(CW%g\fR directory." or "I scanned \f(CW%g\fR directories." +assumes there's only singular or plural. But, to use linguistic +jargon again, Arabic has grammatical number, like English (but unlike +Chinese), but it's a three-term category: singular, dual, and plural. +In other words, the way you say "directory" depends on whether there's +one directory, or \fItwo\fR of them, or \fImore than two\fR of them. Your +test of \f(CW\*(C`($directory == 1)\*(C'\fR no longer does the job. And it means +that where English's grammatical category of number necessitates +only the two permutations of the first sentence based on "directory +[singular]" and "directories [plural]", Arabic has three \-\- and, +worse, in the second sentence ("Your query matched \f(CW%g\fR file in \f(CW%g\fR +directory."), where English has four, Arabic has nine. You sense +an unwelcome, exponential trend taking shape. +.PP +Your Italian translator emails you back and says that "I searched 0 +directories" (a possible English output of your program) is stilted, +and if you think that's fine English, that's your problem, but that +\&\fIjust will not do\fR in the language of Dante. He insists that where +\&\f(CW$directory_count\fR is 0, your program should produce the Italian text +for "I \fIdidn't\fR scan \fIany\fR directories.". And ditto for "I didn't +match any files in any directories", although he says the last part +about "in any directories" should probably just be left off. +.PP +You wonder how you'll get gettext to handle this; to accommodate the +ways Arabic, Chinese, and Italian deal with numbers in just these few +very simple phrases, you need to write code that will ask gettext for +different queries depending on whether the numerical values in +question are 1, 2, more than 2, or in some cases 0, and you still haven't +figured out the problem with the different word order in Chinese. +.PP +Then your Russian translator calls on the phone, to \fIpersonally\fR tell +you the bad news about how really unpleasant your life is about to +become: +.PP +Russian, like German or Latin, is an inflectional language; that is, nouns +and adjectives have to take endings that depend on their case +(i.e., nominative, accusative, genitive, etc...) \-\- which is roughly a matter of +what role they have in syntax of the sentence \-\- +as well as on the grammatical gender (i.e., masculine, feminine, neuter) +and number (i.e., singular or plural) of the noun, as well as on the +declension class of the noun. But unlike with most other inflected languages, +putting a number-phrase (like "ten" or "forty-three", or their Arabic +numeral equivalents) in front of noun in Russian can change the case and +number that noun is, and therefore the endings you have to put on it. +.PP +He elaborates: In "I scanned \f(CW%g\fR directories", you'd \fIexpect\fR +"directories" to be in the accusative case (since it is the direct +object in the sentence) and the plural number, +except where \f(CW$directory_count\fR is 1, then you'd expect the singular, of +course. Just like Latin or German. \fIBut!\fR Where \f(CW$directory_count\fR % +10 is 1 ("%" for modulo, remember), assuming \f(CW$directory\fR count is an +integer, and except where \f(CW$directory_count\fR % 100 is 11, "directories" +is forced to become grammatically singular, which means it gets the +ending for the accusative singular... You begin to visualize the code +it'd take to test for the problem so far, \fIand still work for Chinese +and Arabic and Italian\fR, and how many gettext items that'd take, but +he keeps going... But where \f(CW$directory_count\fR % 10 is 2, 3, or 4 +(except where \f(CW$directory_count\fR % 100 is 12, 13, or 14), the word for +"directories" is forced to be genitive singular \-\- which means another +ending... The room begins to spin around you, slowly at first... But +with \fIall other\fR integer values, since "directory" is an inanimate +noun, when preceded by a number and in the nominative or accusative +cases (as it is here, just your luck!), it does stay plural, but it is +forced into the genitive case \-\- yet another ending... And +you never hear him get to the part about how you're going to run into +similar (but maybe subtly different) problems with other Slavic +languages like Polish, because the floor comes up to meet you, and you +fade into unconsciousness. +.PP +The above cautionary tale relates how an attempt at localization can +lead from programmer consternation, to program obfuscation, to a need +for sedation. But careful evaluation shows that your choice of tools +merely needed further consideration. +.SS "The Linguistic View" +.IX Subsection "The Linguistic View" +.RS 4 +"It is more complicated than you think." +.Sp +\&\-\- The Eighth Networking Truth, from RFC 1925 +.RE +.PP +The field of Linguistics has expended a great deal of effort over the +past century trying to find grammatical patterns which hold across +languages; it's been a constant process +of people making generalizations that should apply to all languages, +only to find out that, all too often, these generalizations fail \-\- +sometimes failing for just a few languages, sometimes whole classes of +languages, and sometimes nearly every language in the world except +English. Broad statistical trends are evident in what the "average +language" is like as far as what its rules can look like, must look +like, and cannot look like. But the "average language" is just as +unreal a concept as the "average person" \-\- it runs up against the +fact no language (or person) is, in fact, average. The wisdom of past +experience leads us to believe that any given language can do whatever +it wants, in any order, with appeal to any kind of grammatical +categories wants \-\- case, number, tense, real or metaphoric +characteristics of the things that words refer to, arbitrary or +predictable classifications of words based on what endings or prefixes +they can take, degree or means of certainty about the truth of +statements expressed, and so on, ad infinitum. +.PP +Mercifully, most localization tasks are a matter of finding ways to +translate whole phrases, generally sentences, where the context is +relatively set, and where the only variation in content is \fIusually\fR +in a number being expressed \-\- as in the example sentences above. +Translating specific, fully-formed sentences is, in practice, fairly +foolproof \-\- which is good, because that's what's in the phrasebooks +that so many tourists rely on. Now, a given phrase (whether in a +phrasebook or in a gettext lexicon) in one language \fImight\fR have a +greater or lesser applicability than that phrase's translation into +another language \-\- for example, strictly speaking, in Arabic, the +"your" in "Your query matched..." would take a different form +depending on whether the user is male or female; so the Arabic +translation "your[feminine] query" is applicable in fewer cases than +the corresponding English phrase, which doesn't distinguish the user's +gender. (In practice, it's not feasible to have a program know the +user's gender, so the masculine "you" in Arabic is usually used, by +default.) +.PP +But in general, such surprises are rare when entire sentences are +being translated, especially when the functional context is restricted +to that of a computer interacting with a user either to convey a fact +or to prompt for a piece of information. So, for purposes of +localization, translation by phrase (generally by sentence) is both the +simplest and the least problematic. +.SS "Breaking gettext" +.IX Subsection "Breaking gettext" +.RS 4 +"It Has To Work." +.Sp +\&\-\- First Networking Truth, RFC 1925 +.RE +.PP +Consider that sentences in a tourist phrasebook are of two types: ones +like "How do I get to the marketplace?" that don't have any blanks to +fill in, and ones like "How much do these _\|_\|_ cost?", where there's +one or more blanks to fill in (and these are usually linked to a +list of words that you can put in that blank: "fish", "potatoes", +"tomatoes", etc.). The ones with no blanks are no problem, but the +fill-in-the-blank ones may not be really straightforward. If it's a +Swahili phrasebook, for example, the authors probably didn't bother to +tell you the complicated ways that the verb "cost" changes its +inflectional prefix depending on the noun you're putting in the blank. +The trader in the marketplace will still understand what you're saying if +you say "how much do these potatoes cost?" with the wrong +inflectional prefix on "cost". After all, \fIyou\fR can't speak proper Swahili, +\&\fIyou're\fR just a tourist. But while tourists can be stupid, computers +are supposed to be smart; the computer should be able to fill in the +blank, and still have the results be grammatical. +.PP +In other words, a phrasebook entry takes some values as parameters +(the things that you fill in the blank or blanks), and provides a value +based on these parameters, where the way you get that final value from +the given values can, properly speaking, involve an arbitrarily +complex series of operations. (In the case of Chinese, it'd be not at +all complex, at least in cases like the examples at the beginning of +this article; whereas in the case of Russian it'd be a rather complex +series of operations. And in some languages, the +complexity could be spread around differently: while the act of +putting a number-expression in front of a noun phrase might not be +complex by itself, it may change how you have to, for example, inflect +a verb elsewhere in the sentence. This is what in syntax is called +"long-distance dependencies".) +.PP +This talk of parameters and arbitrary complexity is just another way +to say that an entry in a phrasebook is what in a programming language +would be called a "function". Just so you don't miss it, this is the +crux of this article: \fIA phrase is a function; a phrasebook is a +bunch of functions.\fR +.PP +The reason that using gettext runs into walls (as in the above +second-person horror story) is that you're trying to use a string (or +worse, a choice among a bunch of strings) to do what you really need a +function for \-\- which is futile. Preforming (s)printf interpolation +on the strings which you get back from gettext does allow you to do \fIsome\fR +common things passably well... sometimes... sort of; but, to paraphrase +what some people say about \f(CW\*(C`csh\*(C'\fR script programming, "it fools you +into thinking you can use it for real things, but you can't, and you +don't discover this until you've already spent too much time trying, +and by then it's too late." +.SS "Replacing gettext" +.IX Subsection "Replacing gettext" +So, what needs to replace gettext is a system that supports lexicons +of functions instead of lexicons of strings. An entry in a lexicon +from such a system should \fInot\fR look like this: +.PP +.Vb 1 +\& "J\*(Aqai trouv\exE9 %g fichiers dans %g r\exE9pertoires" +.Ve +.PP +[\exE9 is e\-acute in Latin\-1. Some pod renderers would +scream if I used the actual character here. \-\- SB] +.PP +but instead like this, bearing in mind that this is just a first stab: +.PP +.Vb 8 +\& sub I_found_X1_files_in_X2_directories { +\& my( $files, $dirs ) = @_[0,1]; +\& $files = sprintf("%g %s", $files, +\& $files == 1 ? \*(Aqfichier\*(Aq : \*(Aqfichiers\*(Aq); +\& $dirs = sprintf("%g %s", $dirs, +\& $dirs == 1 ? "r\exE9pertoire" : "r\exE9pertoires"); +\& return "J\*(Aqai trouv\exE9 $files dans $dirs."; +\& } +.Ve +.PP +Now, there's no particularly obvious way to store anything but strings +in a gettext lexicon; so it looks like we just have to start over and +make something better, from scratch. I call my shot at a +gettext-replacement system "Maketext", or, in CPAN terms, +Locale::Maketext. +.PP +When designing Maketext, I chose to plan its main features in terms of +"buzzword compliance". And here are the buzzwords: +.SS "Buzzwords: Abstraction and Encapsulation" +.IX Subsection "Buzzwords: Abstraction and Encapsulation" +The complexity of the language you're trying to output a phrase in is +entirely abstracted inside (and encapsulated within) the Maketext module +for that interface. When you call: +.PP +.Vb 2 +\& print $lang\->maketext("You have [quant,_1,piece] of new mail.", +\& scalar(@messages)); +.Ve +.PP +you don't know (and in fact can't easily find out) whether this will +involve lots of figuring, as in Russian (if \f(CW$lang\fR is a handle to the +Russian module), or relatively little, as in Chinese. That kind of +abstraction and encapsulation may encourage other pleasant buzzwords +like modularization and stratification, depending on what design +decisions you make. +.SS "Buzzword: Isomorphism" +.IX Subsection "Buzzword: Isomorphism" +"Isomorphism" means "having the same structure or form"; in discussions +of program design, the word takes on the special, specific meaning that +your implementation of a solution to a problem \fIhas the same +structure\fR as, say, an informal verbal description of the solution, or +maybe of the problem itself. Isomorphism is, all things considered, +a good thing \-\- it's what problem-solving (and solution-implementing) +should look like. +.PP +What's wrong the with gettext-using code like this... +.PP +.Vb 9 +\& printf( $file_count == 1 ? +\& ( $directory_count == 1 ? +\& "Your query matched %g file in %g directory." : +\& "Your query matched %g file in %g directories." ) : +\& ( $directory_count == 1 ? +\& "Your query matched %g files in %g directory." : +\& "Your query matched %g files in %g directories." ), +\& $file_count, $directory_count, +\& ); +.Ve +.PP +is first off that it's not well abstracted \-\- these ways of testing +for grammatical number (as in the expressions like \f(CW\*(C`foo == 1 ? +singular_form : plural_form\*(C'\fR) should be abstracted to each language +module, since how you get grammatical number is language-specific. +.PP +But second off, it's not isomorphic \-\- the "solution" (i.e., the +phrasebook entries) for Chinese maps from these four English phrases to +the one Chinese phrase that fits for all of them. In other words, the +informal solution would be "The way to say what you want in Chinese is +with the one phrase 'For your question, in Y directories you would +find X files'" \-\- and so the implemented solution should be, +isomorphically, just a straightforward way to spit out that one +phrase, with numerals properly interpolated. It shouldn't have to map +from the complexity of other languages to the simplicity of this one. +.SS "Buzzword: Inheritance" +.IX Subsection "Buzzword: Inheritance" +There's a great deal of reuse possible for sharing of phrases between +modules for related dialects, or for sharing of auxiliary functions +between related languages. (By "auxiliary functions", I mean +functions that don't produce phrase-text, but which, say, return an +answer to "does this number require a plural noun after it?". Such +auxiliary functions would be used in the internal logic of functions +that actually do produce phrase-text.) +.PP +In the case of sharing phrases, consider that you have an interface +already localized for American English (probably by having been +written with that as the native locale, but that's incidental). +Localizing it for UK English should, in practical terms, be just a +matter of running it past a British person with the instructions to +indicate what few phrases would benefit from a change in spelling or +possibly minor rewording. In that case, you should be able to put in +the UK English localization module \fIonly\fR those phrases that are +UK-specific, and for all the rest, \fIinherit\fR from the American +English module. (And I expect this same situation would apply with +Brazilian and Continental Portugese, possibly with some \fIvery\fR +closely related languages like Czech and Slovak, and possibly with the +slightly different "versions" of written Mandarin Chinese, as I hear exist in +Taiwan and mainland China.) +.PP +As to sharing of auxiliary functions, consider the problem of Russian +numbers from the beginning of this article; obviously, you'd want to +write only once the hairy code that, given a numeric value, would +return some specification of which case and number a given quantified +noun should use. But suppose that you discover, while localizing an +interface for, say, Ukrainian (a Slavic language related to Russian, +spoken by several million people, many of whom would be relieved to +find that your Web site's or software's interface is available in +their language), that the rules in Ukrainian are the same as in Russian +for quantification, and probably for many other grammatical functions. +While there may well be no phrases in common between Russian and +Ukrainian, you could still choose to have the Ukrainian module inherit +from the Russian module, just for the sake of inheriting all the +various grammatical methods. Or, probably better organizationally, +you could move those functions to a module called \f(CW\*(C`_E_Slavic\*(C'\fR or +something, which Russian and Ukrainian could inherit useful functions +from, but which would (presumably) provide no lexicon. +.SS "Buzzword: Concision" +.IX Subsection "Buzzword: Concision" +Okay, concision isn't a buzzword. But it should be, so I decree that +as a new buzzword, "concision" means that simple common things should +be expressible in very few lines (or maybe even just a few characters) +of code \-\- call it a special case of "making simple things easy and +hard things possible", and see also the role it played in the +MIDI::Simple language, discussed elsewhere in this issue [TPJ#13]. +.PP +Consider our first stab at an entry in our "phrasebook of functions": +.PP +.Vb 8 +\& sub I_found_X1_files_in_X2_directories { +\& my( $files, $dirs ) = @_[0,1]; +\& $files = sprintf("%g %s", $files, +\& $files == 1 ? \*(Aqfichier\*(Aq : \*(Aqfichiers\*(Aq); +\& $dirs = sprintf("%g %s", $dirs, +\& $dirs == 1 ? "r\exE9pertoire" : "r\exE9pertoires"); +\& return "J\*(Aqai trouv\exE9 $files dans $dirs."; +\& } +.Ve +.PP +You may sense that a lexicon (to use a non-committal catch-all term for a +collection of things you know how to say, regardless of whether they're +phrases or words) consisting of functions \fIexpressed\fR as above would +make for rather long-winded and repetitive code \-\- even if you wisely +rewrote this to have quantification (as we call adding a number +expression to a noun phrase) be a function called like: +.PP +.Vb 6 +\& sub I_found_X1_files_in_X2_directories { +\& my( $files, $dirs ) = @_[0,1]; +\& $files = quant($files, "fichier"); +\& $dirs = quant($dirs, "r\exE9pertoire"); +\& return "J\*(Aqai trouv\exE9 $files dans $dirs."; +\& } +.Ve +.PP +And you may also sense that you do not want to bother your translators +with having to write Perl code \-\- you'd much rather that they spend +their \fIvery costly time\fR on just translation. And this is to say +nothing of the near impossibility of finding a commercial translator +who would know even simple Perl. +.PP +In a first-hack implementation of Maketext, each language-module's +lexicon looked like this: +.PP +.Vb 10 +\& %Lexicon = ( +\& "I found %g files in %g directories" +\& => sub { +\& my( $files, $dirs ) = @_[0,1]; +\& $files = quant($files, "fichier"); +\& $dirs = quant($dirs, "r\exE9pertoire"); +\& return "J\*(Aqai trouv\exE9 $files dans $dirs."; +\& }, +\& ... and so on with other phrase => sub mappings ... +\& ); +.Ve +.PP +but I immediately went looking for some more concise way to basically +denote the same phrase-function \-\- a way that would also serve to +concisely denote \fImost\fR phrase-functions in the lexicon for \fImost\fR +languages. After much time and even some actual thought, I decided on +this system: +.PP +* Where a value in a \f(CW%Lexicon\fR hash is a contentful string instead of +an anonymous sub (or, conceivably, a coderef), it would be interpreted +as a sort of shorthand expression of what the sub does. When accessed +for the first time in a session, it is parsed, turned into Perl code, +and then eval'd into an anonymous sub; then that sub replaces the +original string in that lexicon. (That way, the work of parsing and +evaling the shorthand form for a given phrase is done no more than +once per session.) +.PP +* Calls to \f(CW\*(C`maketext\*(C'\fR (as Maketext's main function is called) happen +thru a "language session handle", notionally very much like an IO +handle, in that you open one at the start of the session, and use it +for "sending signals" to an object in order to have it return the text +you want. +.PP +So, this: +.PP +.Vb 2 +\& $lang\->maketext("You have [quant,_1,piece] of new mail.", +\& scalar(@messages)); +.Ve +.PP +basically means this: look in the lexicon for \f(CW$lang\fR (which may inherit +from any number of other lexicons), and find the function that we +happen to associate with the string "You have [quant,_1,piece] of new +mail" (which is, and should be, a functioning "shorthand" for this +function in the native locale \-\- English in this case). If you find +such a function, call it with \f(CW$lang\fR as its first parameter (as if it +were a method), and then a copy of scalar(@messages) as its second, +and then return that value. If that function was found, but was in +string shorthand instead of being a fully specified function, parse it +and make it into a function before calling it the first time. +.PP +* The shorthand uses code in brackets to indicate method calls that +should be performed. A full explanation is not in order here, but a +few examples will suffice: +.PP +.Vb 1 +\& "You have [quant,_1,piece] of new mail." +.Ve +.PP +The above code is shorthand for, and will be interpreted as, +this: +.PP +.Vb 8 +\& sub { +\& my $handle = $_[0]; +\& my(@params) = @_; +\& return join \*(Aq\*(Aq, +\& "You have ", +\& $handle\->quant($params[1], \*(Aqpiece\*(Aq), +\& "of new mail."; +\& } +.Ve +.PP +where "quant" is the name of a method you're using to quantify the +noun "piece" with the number \f(CW$params\fR[0]. +.PP +A string with no brackety calls, like this: +.PP +.Vb 1 +\& "Your search expression was malformed." +.Ve +.PP +is somewhat of a degenerate case, and just gets turned into: +.PP +.Vb 1 +\& sub { return "Your search expression was malformed." } +.Ve +.PP +However, not everything you can write in Perl code can be written in +the above shorthand system \-\- not by a long shot. For example, consider +the Italian translator from the beginning of this article, who wanted +the Italian for "I didn't find any files" as a special case, instead +of "I found 0 files". That couldn't be specified (at least not easily +or simply) in our shorthand system, and it would have to be written +out in full, like this: +.PP +.Vb 10 +\& sub { # pretend the English strings are in Italian +\& my($handle, $files, $dirs) = @_[0,1,2]; +\& return "I didn\*(Aqt find any files" unless $files; +\& return join \*(Aq\*(Aq, +\& "I found ", +\& $handle\->quant($files, \*(Aqfile\*(Aq), +\& " in ", +\& $handle\->quant($dirs, \*(Aqdirectory\*(Aq), +\& "."; +\& } +.Ve +.PP +Next to a lexicon full of shorthand code, that sort of sticks out like a +sore thumb \-\- but this \fIis\fR a special case, after all; and at least +it's possible, if not as concise as usual. +.PP +As to how you'd implement the Russian example from the beginning of +the article, well, There's More Than One Way To Do It, but it could be +something like this (using English words for Russian, just so you know +what's going on): +.PP +.Vb 1 +\& "I [quant,_1,directory,accusative] scanned." +.Ve +.PP +This shifts the burden of complexity off to the quant method. That +method's parameters are: the numeric value it's going to use to +quantify something; the Russian word it's going to quantify; and the +parameter "accusative", which you're using to mean that this +sentence's syntax wants a noun in the accusative case there, although +that quantification method may have to overrule, for grammatical +reasons you may recall from the beginning of this article. +.PP +Now, the Russian quant method here is responsible not only for +implementing the strange logic necessary for figuring out how Russian +number-phrases impose case and number on their noun-phrases, but also +for inflecting the Russian word for "directory". How that inflection +is to be carried out is no small issue, and among the solutions I've +seen, some (like variations on a simple lookup in a hash where all +possible forms are provided for all necessary words) are +straightforward but \fIcan\fR become cumbersome when you need to inflect +more than a few dozen words; and other solutions (like using +algorithms to model the inflections, storing only root forms and +irregularities) \fIcan\fR involve more overhead than is justifiable for +all but the largest lexicons. +.PP +Mercifully, this design decision becomes crucial only in the hairiest +of inflected languages, of which Russian is by no means the \fIworst\fR case +scenario, but is worse than most. Most languages have simpler +inflection systems; for example, in English or Swahili, there are +generally no more than two possible inflected forms for a given noun +("error/errors"; "kosa/makosa"), and the +rules for producing these forms are fairly simple \-\- or at least, +simple rules can be formulated that work for most words, and you can +then treat the exceptions as just "irregular", at least relative to +your ad hoc rules. A simpler inflection system (simpler rules, fewer +forms) means that design decisions are less crucial to maintaining +sanity, whereas the same decisions could incur +overhead-versus-scalability problems in languages like Russian. It +may \fIalso\fR be likely that code (possibly in Perl, as with +Lingua::EN::Inflect, for English nouns) has already +been written for the language in question, whether simple or complex. +.PP +Moreover, a third possibility may even be simpler than anything +discussed above: "Just require that all possible (or at least +applicable) forms be provided in the call to the given language's quant +method, as in:" +.PP +.Vb 1 +\& "I found [quant,_1,file,files]." +.Ve +.PP +That way, quant just has to chose which form it needs, without having +to look up or generate anything. While possibly not optimal for +Russian, this should work well for most other languages, where +quantification is not as complicated an operation. +.SS "The Devil in the Details" +.IX Subsection "The Devil in the Details" +There's plenty more to Maketext than described above \-\- for example, +there's the details of how language tags ("en-US", "i\-pwn", "fi", +etc.) or locale IDs ("en_US") interact with actual module naming +("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the +details of how to record (and possibly negotiate) what character +encoding Maketext will return text in (UTF8? Latin\-1? KOI8?). There's +the interesting fact that Maketext is for localization, but nowhere +actually has a "\f(CW\*(C`use locale;\*(C'\fR" anywhere in it. For the curious, +there's the somewhat frightening details of how I actually +implement something like data inheritance so that searches across +modules' \f(CW%Lexicon\fR hashes can parallel how Perl implements method +inheritance. +.PP +And, most importantly, there's all the practical details of how to +actually go about deriving from Maketext so you can use it for your +interfaces, and the various tools and conventions for starting out and +maintaining individual language modules. +.PP +That is all covered in the documentation for Locale::Maketext and the +modules that come with it, available in CPAN. After having read this +article, which covers the why's of Maketext, the documentation, +which covers the how's of it, should be quite straightforward. +.SS "The Proof in the Pudding: Localizing Web Sites" +.IX Subsection "The Proof in the Pudding: Localizing Web Sites" +Maketext and gettext have a notable difference: gettext is in C, +accessible thru C library calls, whereas Maketext is in Perl, and +really can't work without a Perl interpreter (although I suppose +something like it could be written for C). Accidents of history (and +not necessarily lucky ones) have made C++ the most common language for +the implementation of applications like word processors, Web browsers, +and even many in-house applications like custom query systems. Current +conditions make it somewhat unlikely that the next one of any of these +kinds of applications will be written in Perl, albeit clearly more for +reasons of custom and inertia than out of consideration of what is the +right tool for the job. +.PP +However, other accidents of history have made Perl a well-accepted +language for design of server-side programs (generally in CGI form) +for Web site interfaces. Localization of static pages in Web sites is +trivial, feasible either with simple language-negotiation features in +servers like Apache, or with some kind of server-side inclusions of +language-appropriate text into layout templates. However, I think +that the localization of Perl-based search systems (or other kinds of +dynamic content) in Web sites, be they public or access-restricted, +is where Maketext will see the greatest use. +.PP +I presume that it would be only the exceptional Web site that gets +localized for English \fIand\fR Chinese \fIand\fR Italian \fIand\fR Arabic +\&\fIand\fR Russian, to recall the languages from the beginning of this +article \-\- to say nothing of German, Spanish, French, Japanese, +Finnish, and Hindi, to name a few languages that benefit from large +numbers of programmers or Web viewers or both. +.PP +However, the ever-increasing internationalization of the Web (whether +measured in terms of amount of content, of numbers of content writers +or programmers, or of size of content audiences) makes it increasingly +likely that the interface to the average Web-based dynamic content +service will be localized for two or maybe three languages. It is my +hope that Maketext will make that task as simple as possible, and will +remove previous barriers to localization for languages dissimilar to +English. +.PP +.Vb 1 +\& _\|_END_\|_ +.Ve +.PP +Sean M. Burke (sburke@cpan.org) has a Master's in linguistics +from Northwestern University; he specializes in language technology. +Jordan Lachler (lachler@unm.edu) is a PhD student in the Department of +Linguistics at the University of New Mexico; he specializes in +morphology and pedagogy of North American native languages. +.SS References +.IX Subsection "References" +Alvestrand, Harald Tveit. 1995. \fIRFC 1766: Tags for the +Identification of Languages.\fR +\&\f(CW\*(C`<http://www.ietf.org/rfc/rfc1766.txt>\*(C'\fR +[Now see RFC 3066.] +.PP +Callon, Ross, editor. 1996. \fIRFC 1925: The Twelve +Networking Truths.\fR +\&\f(CW\*(C`<http://www.ietf.org/rfc/rfc1925.txt>\*(C'\fR +.PP +Drepper, Ulrich, Peter Miller, +and François Pinard. 1995\-2001. GNU +\&\f(CW\*(C`gettext\*(C'\fR. Available in \f(CW\*(C`<ftp://prep.ai.mit.edu/pub/gnu/>\*(C'\fR, with +extensive docs in the distribution tarball. [Since +I wrote this article in 1998, I now see that the +gettext docs are now trying more to come to terms with +plurality. Whether useful conclusions have come from it +is another question altogether. \-\- SMB, May 2001] +.PP +Forbes, Nevill. 1964. \fIRussian Grammar.\fR Third Edition, revised +by J. C. Dumbreck. Oxford University Press. |