diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/enc2xs.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/enc2xs.1 | 351 |
1 files changed, 351 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/enc2xs.1 b/upstream/mageia-cauldron/man1/enc2xs.1 new file mode 100644 index 00000000..8bac403c --- /dev/null +++ b/upstream/mageia-cauldron/man1/enc2xs.1 @@ -0,0 +1,351 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "ENC2XS 1" +.TH ENC2XS 1 2023-12-15 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +enc2xs \-\- Perl Encode Module Generator +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 3 +\& enc2xs \-[options] +\& enc2xs \-M ModName mapfiles... +\& enc2xs \-C +.Ve +.SH DESCRIPTION +.IX Header "DESCRIPTION" +\&\fIenc2xs\fR builds a Perl extension for use by Encode from either +Unicode Character Mapping files (.ucm) or Tcl Encoding Files (.enc). +Besides being used internally during the build process of the Encode +module, you can use \fIenc2xs\fR to add your own encoding to perl. +No knowledge of XS is necessary. +.SH "Quick Guide" +.IX Header "Quick Guide" +If you want to know as little about Perl as possible but need to +add a new encoding, just read this chapter and forget the rest. +.IP 0. 4 +.IX Item "0." +Have a .ucm file ready. You can get it from somewhere or you can write +your own from scratch or you can grab one from the Encode distribution +and customize it. For the UCM format, see the next Chapter. In the +example below, I'll call my theoretical encoding myascii, defined +in \fImy.ucm\fR. \f(CW\*(C`$\*(C'\fR is a shell prompt. +.Sp +.Vb 2 +\& $ ls \-F +\& my.ucm +.Ve +.IP 1. 4 +.IX Item "1." +Issue a command as follows; +.Sp +.Vb 5 +\& $ enc2xs \-M My my.ucm +\& generating Makefile.PL +\& generating My.pm +\& generating README +\& generating Changes +.Ve +.Sp +Now take a look at your current directory. It should look like this. +.Sp +.Vb 2 +\& $ ls \-F +\& Makefile.PL My.pm my.ucm t/ +.Ve +.Sp +The following files were created. +.Sp +.Vb 3 +\& Makefile.PL \- MakeMaker script +\& My.pm \- Encode submodule +\& t/My.t \- test file +.Ve +.RS 4 +.IP 1.1. 4 +.IX Item "1.1." +If you want *.ucm installed together with the modules, do as follows; +.Sp +.Vb 3 +\& $ mkdir Encode +\& $ mv *.ucm Encode +\& $ enc2xs \-M My Encode/*ucm +.Ve +.RE +.RS 4 +.RE +.IP 2. 4 +.IX Item "2." +Edit the files generated. You don't have to if you have no time AND no +intention to give it to someone else. But it is a good idea to edit +the pod and to add more tests. +.IP 3. 4 +.IX Item "3." +Now issue a command all Perl Mongers love: +.Sp +.Vb 2 +\& $ perl Makefile.PL +\& Writing Makefile for Encode::My +.Ve +.IP 4. 4 +.IX Item "4." +Now all you have to do is make. +.Sp +.Vb 12 +\& $ make +\& cp My.pm blib/lib/Encode/My.pm +\& /usr/local/bin/perl /usr/local/bin/enc2xs \-Q \-O \e +\& \-o encode_t.c \-f encode_t.fnm +\& Reading myascii (myascii) +\& Writing compiled form +\& 128 bytes in string tables +\& 384 bytes (75%) saved spotting duplicates +\& 1 bytes (0.775%) saved using substrings +\& .... +\& chmod 644 blib/arch/auto/Encode/My/My.bs +\& $ +.Ve +.Sp +The time it takes varies depending on how fast your machine is and +how large your encoding is. Unless you are working on something big +like euc-tw, it won't take too long. +.IP 5. 4 +.IX Item "5." +You can "make install" already but you should test first. +.Sp +.Vb 8 +\& $ make test +\& PERL_DL_NONLAZY=1 /usr/local/bin/perl \-Iblib/arch \-Iblib/lib \e +\& \-e \*(Aquse Test::Harness qw(&runtests $verbose); \e +\& $verbose=0; runtests @ARGV;\*(Aq t/*.t +\& t/My....ok +\& All tests successful. +\& Files=1, Tests=2, 0 wallclock secs +\& ( 0.09 cusr + 0.01 csys = 0.09 CPU) +.Ve +.IP 6. 4 +.IX Item "6." +If you are content with the test result, just "make install" +.IP 7. 4 +.IX Item "7." +If you want to add your encoding to Encode's demand-loading list +(so you don't have to "use Encode::YourEncoding"), run +.Sp +.Vb 1 +\& enc2xs \-C +.Ve +.Sp +to update Encode::ConfigLocal, a module that controls local settings. +After that, "use Encode;" is enough to load your encodings on demand. +.SH "The Unicode Character Map" +.IX Header "The Unicode Character Map" +Encode uses the Unicode Character Map (UCM) format for source character +mappings. This format is used by IBM's ICU package and was adopted +by Nick Ing-Simmons for use with the Encode module. Since UCM is +more flexible than Tcl's Encoding Map and far more user-friendly, +this is the recommended format for Encode now. +.PP +A UCM file looks like this. +.PP +.Vb 10 +\& # +\& # Comments +\& # +\& <code_set_name> "US\-ascii" # Required +\& <code_set_alias> "ascii" # Optional +\& <mb_cur_min> 1 # Required; usually 1 +\& <mb_cur_max> 1 # Max. # of bytes/char +\& <subchar> \ex3F # Substitution char +\& # +\& CHARMAP +\& <U0000> \ex00 |0 # <control> +\& <U0001> \ex01 |0 # <control> +\& <U0002> \ex02 |0 # <control> +\& .... +\& <U007C> \ex7C |0 # VERTICAL LINE +\& <U007D> \ex7D |0 # RIGHT CURLY BRACKET +\& <U007E> \ex7E |0 # TILDE +\& <U007F> \ex7F |0 # <control> +\& END CHARMAP +.Ve +.IP \(bu 4 +Anything that follows \f(CW\*(C`#\*(C'\fR is treated as a comment. +.IP \(bu 4 +The header section continues until a line containing the word +CHARMAP. This section has a form of \fI<keyword> value\fR, one +pair per line. Strings used as values must be quoted. Barewords are +treated as numbers. \fI\exXX\fR represents a byte. +.Sp +Most of the keywords are self-explanatory. \fIsubchar\fR means +substitution character, not subcharacter. When you decode a Unicode +sequence to this encoding but no matching character is found, the byte +sequence defined here will be used. For most cases, the value here is +\&\ex3F; in ASCII, this is a question mark. +.IP \(bu 4 +CHARMAP starts the character map section. Each line has a form as +follows: +.Sp +.Vb 5 +\& <UXXXX> \exXX.. |0 # comment +\& ^ ^ ^ +\& | | +\- Fallback flag +\& | +\-\-\-\-\-\-\-\- Encoded byte sequence +\& +\-\-\-\-\-\-\-\-\-\-\-\-\-\- Unicode Character ID in hex +.Ve +.Sp +The format is roughly the same as a header section except for the +fallback flag: | followed by 0..3. The meaning of the possible +values is as follows: +.RS 4 +.IP |0 4 +.IX Item "|0" +Round trip safe. A character decoded to Unicode encodes back to the +same byte sequence. Most characters have this flag. +.IP |1 4 +.IX Item "|1" +Fallback for unicode \-> encoding. When seen, enc2xs adds this +character for the encode map only. +.IP |2 4 +.IX Item "|2" +Skip sub-char mapping should there be no code point. +.IP |3 4 +.IX Item "|3" +Fallback for encoding \-> unicode. When seen, enc2xs adds this +character for the decode map only. +.RE +.RS 4 +.RE +.IP \(bu 4 +And finally, END OF CHARMAP ends the section. +.PP +When you are manually creating a UCM file, you should copy ascii.ucm +or an existing encoding which is close to yours, rather than write +your own from scratch. +.PP +When you do so, make sure you leave at least \fBU0000\fR to \fBU0020\fR as +is, unless your environment is EBCDIC. +.PP +\&\fBCAVEAT\fR: not all features in UCM are implemented. For example, +icu:state is not used. Because of that, you need to write a perl +module if you want to support algorithmical encodings, notably +the ISO\-2022 series. Such modules include Encode::JP::2022_JP, +Encode::KR::2022_KR, and Encode::TW::HZ. +.SS "Coping with duplicate mappings" +.IX Subsection "Coping with duplicate mappings" +When you create a map, you SHOULD make your mappings round-trip safe. +That is, \f(CWencode(\*(Aqyour\-encoding\*(Aq, decode(\*(Aqyour\-encoding\*(Aq, $data)) eq +$data\fR stands for all characters that are marked as \f(CW\*(C`|0\*(C'\fR. Here is +how to make sure: +.IP \(bu 4 +Sort your map in Unicode order. +.IP \(bu 4 +When you have a duplicate entry, mark either one with '|1' or '|3'. +.IP \(bu 4 +And make sure the '|1' or '|3' entry FOLLOWS the '|0' entry. +.PP +Here is an example from big5\-eten. +.PP +.Vb 2 +\& <U2550> \exF9\exF9 |0 +\& <U2550> \exA2\exA4 |3 +.Ve +.PP +Internally Encoding \-> Unicode and Unicode \-> Encoding Map looks like +this; +.PP +.Vb 4 +\& E to U U to E +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& \exF9\exF9 => U2550 U2550 => \exF9\exF9 +\& \exA2\exA4 => U2550 +.Ve +.PP +So it is round-trip safe for \exF9\exF9. But if the line above is upside +down, here is what happens. +.PP +.Vb 4 +\& E to U U to E +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& \exA2\exA4 => U2550 U2550 => \exF9\exF9 +\& (\exF9\exF9 => U2550 is now overwritten!) +.Ve +.PP +The Encode package comes with \fIucmlint\fR, a crude but sufficient +utility to check the integrity of a UCM file. Check under the +Encode/bin directory for this. +.PP +When in doubt, you can use \fIucmsort\fR, yet another utility under +Encode/bin directory. +.SH Bookmarks +.IX Header "Bookmarks" +.IP \(bu 4 +ICU Home Page +<http://www.icu\-project.org/> +.IP \(bu 4 +ICU Character Mapping Tables +<http://site.icu\-project.org/charts/charset> +.IP \(bu 4 +ICU:Conversion Data +<http://www.icu\-project.org/userguide/conversion\-data.html> +.SH "SEE ALSO" +.IX Header "SEE ALSO" +Encode, +perlmod, +perlpod |