diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlebcdic.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlebcdic.1 | 1998 |
1 files changed, 1998 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlebcdic.1 b/upstream/mageia-cauldron/man1/perlebcdic.1 new file mode 100644 index 00000000..b23c5fd5 --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlebcdic.1 @@ -0,0 +1,1998 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLEBCDIC 1" +.TH PERLEBCDIC 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlebcdic \- Considerations for running Perl on EBCDIC platforms +.SH DESCRIPTION +.IX Header "DESCRIPTION" +An exploration of some of the issues facing Perl programmers +on EBCDIC based computers. +.PP +Portions of this document that are still incomplete are marked with XXX. +.PP +Early Perl versions worked on some EBCDIC machines, but the last known +version that ran on EBCDIC was v5.8.7, until v5.22, when the Perl core +again works on z/OS. Theoretically, it could work on OS/400 or Siemens' +BS2000 (or their successors), but this is untested. In v5.22 and 5.24, +not all +the modules found on CPAN but shipped with core Perl work on z/OS. +.PP +If you want to use Perl on a non\-z/OS EBCDIC machine, please let us know +at <https://github.com/Perl/perl5/issues>. +.PP +Writing Perl on an EBCDIC platform is really no different than writing +on an "ASCII" one, but with different underlying numbers, as we'll see +shortly. You'll have to know something about those "ASCII" platforms +because the documentation is biased and will frequently use example +numbers that don't apply to EBCDIC. There are also very few CPAN +modules that are written for EBCDIC and which don't work on ASCII; +instead the vast majority of CPAN modules are written for ASCII, and +some may happen to work on EBCDIC, while a few have been designed to +portably work on both. +.PP +If your code just uses the 52 letters A\-Z and a\-z, plus SPACE, the +digits 0\-9, and the punctuation characters that Perl uses, plus a few +controls that are denoted by escape sequences like \f(CW\*(C`\en\*(C'\fR and \f(CW\*(C`\et\*(C'\fR, then +there's nothing special about using Perl, and your code may very well +work on an ASCII machine without change. +.PP +But if you write code that uses \f(CW\*(C`\e005\*(C'\fR to mean a TAB or \f(CW\*(C`\exC1\*(C'\fR to mean +an "A", or \f(CW\*(C`\exDF\*(C'\fR to mean a "ÿ" (small \f(CW"y"\fR with a diaeresis), +then your code may well work on your EBCDIC platform, but not on an +ASCII one. That's fine to do if no one will ever want to run your code +on an ASCII platform; but the bias in this document will be towards writing +code portable between EBCDIC and ASCII systems. Again, if every +character you care about is easily enterable from your keyboard, you +don't have to know anything about ASCII, but many keyboards don't easily +allow you to directly enter, say, the character \f(CW\*(C`\exDF\*(C'\fR, so you have to +specify it indirectly, such as by using the \f(CW"\exDF"\fR escape sequence. +In those cases it's easiest to know something about the ASCII/Unicode +character sets. If you know that the small "ÿ" is \f(CW\*(C`U+00FF\*(C'\fR, then +you can instead specify it as \f(CW"\eN{U+FF}"\fR, and have the computer +automatically translate it to \f(CW\*(C`\exDF\*(C'\fR on your platform, and leave it as +\&\f(CW\*(C`\exFF\*(C'\fR on ASCII ones. Or you could specify it by name, \f(CW\*(C`\eN{LATIN +SMALL LETTER Y WITH DIAERESIS\*(C'\fR and not have to know the numbers. +Either way works, but both require familiarity with Unicode. +.SH "COMMON CHARACTER CODE SETS" +.IX Header "COMMON CHARACTER CODE SETS" +.SS ASCII +.IX Subsection "ASCII" +The American Standard Code for Information Interchange (ASCII or +US-ASCII) is a set of +integers running from 0 to 127 (decimal) that have standardized +interpretations by the computers which use ASCII. For example, 65 means +the letter "A". +The range 0..127 can be covered by setting various bits in a 7\-bit binary +digit, hence the set is sometimes referred to as "7\-bit ASCII". +ASCII was described by the American National Standards Institute +document ANSI X3.4\-1986. It was also described by ISO 646:1991 +(with localization for currency symbols). The full ASCII set is +given in the table below as the first 128 elements. +Languages that +can be written adequately with the characters in ASCII include +English, Hawaiian, Indonesian, Swahili and some Native American +languages. +.PP +Most non-EBCDIC character sets are supersets of ASCII. That is the +integers 0\-127 mean what ASCII says they mean. But integers 128 and +above are specific to the character set. +.PP +Many of these fit entirely into 8 bits, using ASCII as 0\-127, while +specifying what 128\-255 mean, and not using anything above 255. +Thus, these are single-byte (or octet if you prefer) character sets. +One important one (since Unicode is a superset of it) is the ISO 8859\-1 +character set. +.SS "ISO 8859" +.IX Subsection "ISO 8859" +The ISO 8859\-\fR\f(CB$n\fR\f(BI\fR\fI\fR are a collection of character code sets from the +International Organization for Standardization (ISO), each of which adds +characters to the ASCII set that are typically found in various +languages, many of which are based on the Roman, or Latin, alphabet. +Most are for European languages, but there are also ones for Arabic, +Greek, Hebrew, and Thai. There are good references on the web about +all these. +.SS "Latin 1 (ISO 8859\-1)" +.IX Subsection "Latin 1 (ISO 8859-1)" +A particular 8\-bit extension to ASCII that includes grave and acute +accented Latin characters. Languages that can employ ISO 8859\-1 +include all the languages covered by ASCII as well as Afrikaans, +Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, +Portuguese, Spanish, and Swedish. Dutch is covered albeit without +the ij ligature. French is covered too but without the oe ligature. +German can use ISO 8859\-1 but must do so without German-style +quotation marks. This set is based on Western European extensions +to ASCII and is commonly encountered in world wide web work. +In IBM character code set identification terminology, ISO 8859\-1 is +also known as CCSID 819 (or sometimes 0819 or even 00819). +.SS EBCDIC +.IX Subsection "EBCDIC" +The Extended Binary Coded Decimal Interchange Code refers to a +large collection of single\- and multi-byte coded character sets that are +quite different from ASCII and ISO 8859\-1, and are all slightly +different from each other; they typically run on host computers. The +EBCDIC encodings derive from 8\-bit byte extensions of Hollerith punched +card encodings, which long predate ASCII. The layout on the +cards was such that high bits were set for the upper and lower case +alphabetic +characters \f(CW\*(C`[a\-z]\*(C'\fR and \f(CW\*(C`[A\-Z]\*(C'\fR, but there were gaps within each Latin +alphabet range, visible in the table below. These gaps can +cause complications. +.PP +Some IBM EBCDIC character sets may be known by character code set +identification numbers (CCSID numbers) or code page numbers. +.PP +Perl can be compiled on platforms that run any of three commonly used EBCDIC +character sets, listed below. +.PP +\fIThe 13 variant characters\fR +.IX Subsection "The 13 variant characters" +.PP +Among IBM EBCDIC character code sets there are 13 characters that +are often mapped to different integer values. Those characters +are known as the 13 "variant" characters and are: +.PP +.Vb 1 +\& \e [ ] { } ^ ~ ! # | $ @ \` +.Ve +.PP +When Perl is compiled for a platform, it looks at all of these characters to +guess which EBCDIC character set the platform uses, and adapts itself +accordingly to that platform. If the platform uses a character set that is not +one of the three Perl knows about, Perl will either fail to compile, or +mistakenly and silently choose one of the three. +.PP +The Line Feed (LF) character is actually a 14th variant character, and +Perl checks for that as well. +.PP +\fIEBCDIC code sets recognized by Perl\fR +.IX Subsection "EBCDIC code sets recognized by Perl" +.IP \fB0037\fR 4 +.IX Item "0037" +Character code set ID 0037 is a mapping of the ASCII plus Latin\-1 +characters (i.e. ISO 8859\-1) to an EBCDIC set. 0037 is used +in North American English locales on the OS/400 operating system +that runs on AS/400 computers. CCSID 0037 differs from ISO 8859\-1 +in 236 places; in other words they agree on only 20 code point values. +.IP \fB1047\fR 4 +.IX Item "1047" +Character code set ID 1047 is also a mapping of the ASCII plus +Latin\-1 characters (i.e. ISO 8859\-1) to an EBCDIC set. 1047 is +used under Unix System Services for OS/390 or z/OS, and OpenEdition +for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places, +and from ISO 8859\-1 in 236. +.IP \fBPOSIX-BC\fR 4 +.IX Item "POSIX-BC" +The EBCDIC code page in use on Siemens' BS2000 system is distinct from +1047 and 0037. It is identified below as the POSIX-BC set. +Like 0037 and 1047, it is the same as ISO 8859\-1 in 20 code point +values. +.SS "Unicode code points versus EBCDIC code points" +.IX Subsection "Unicode code points versus EBCDIC code points" +In Unicode terminology a \fIcode point\fR is the number assigned to a +character: for example, in EBCDIC the character "A" is usually assigned +the number 193. In Unicode, the character "A" is assigned the number 65. +All the code points in ASCII and Latin\-1 (ISO 8859\-1) have the same +meaning in Unicode. All three of the recognized EBCDIC code sets have +256 code points, and in each code set, all 256 code points are mapped to +equivalent Latin1 code points. Obviously, "A" will map to "A", "B" => +"B", "%" => "%", etc., for all printable characters in Latin1 and these +code pages. +.PP +It also turns out that EBCDIC has nearly precise equivalents for the +ASCII/Latin1 C0 controls and the DELETE control. (The C0 controls are +those whose ASCII code points are 0..0x1F; things like TAB, ACK, BEL, +etc.) A mapping is set up between these ASCII/EBCDIC controls. There +isn't such a precise mapping between the C1 controls on ASCII platforms +and the remaining EBCDIC controls. What has been done is to map these +controls, mostly arbitrarily, to some otherwise unmatched character in +the other character set. Most of these are very very rarely used +nowadays in EBCDIC anyway, and their names have been dropped, without +much complaint. For example the EO (Eight Ones) EBCDIC control +(consisting of eight one bits = 0xFF) is mapped to the C1 APC control +(0x9F), and you can't use the name "EO". +.PP +The EBCDIC controls provide three possible line terminator characters, +CR (0x0D), LF (0x25), and NL (0x15). On ASCII platforms, the symbols +"NL" and "LF" refer to the same character, but in strict EBCDIC +terminology they are different ones. The EBCDIC NL is mapped to the C1 +control called "NEL" ("Next Line"; here's a case where the mapping makes +quite a bit of sense, and hence isn't just arbitrary). On some EBCDIC +platforms, this NL or NEL is the typical line terminator. This is true +of z/OS and BS2000. In these platforms, the C compilers will swap the +LF and NEL code points, so that \f(CW"\en"\fR is 0x15, and refers to NL. Perl +does that too; you can see it in the code chart below. +This makes things generally "just work" without you even having to be +aware that there is a swap. +.SS "Unicode and UTF" +.IX Subsection "Unicode and UTF" +UTF stands for "Unicode Transformation Format". +UTF\-8 is an encoding of Unicode into a sequence of 8\-bit byte chunks, based on +ASCII and Latin\-1. +The length of a sequence required to represent a Unicode code point +depends on the ordinal number of that code point, +with larger numbers requiring more bytes. +UTF-EBCDIC is like UTF\-8, but based on EBCDIC. +They are enough alike that often, casual usage will conflate the two +terms, and use "UTF\-8" to mean both the UTF\-8 found on ASCII platforms, +and the UTF-EBCDIC found on EBCDIC ones. +.PP +You may see the term "invariant" character or code point. +This simply means that the character has the same numeric +value and representation when encoded in UTF\-8 (or UTF-EBCDIC) as when +not. (Note that this is a very different concept from "The 13 variant +characters" mentioned above. Careful prose will use the term "UTF\-8 +invariant" instead of just "invariant", but most often you'll see just +"invariant".) For example, the ordinal value of "A" is 193 in most +EBCDIC code pages, and also is 193 when encoded in UTF-EBCDIC. All +UTF\-8 (or UTF-EBCDIC) variant code points occupy at least two bytes when +encoded in UTF\-8 (or UTF-EBCDIC); by definition, the UTF\-8 (or +UTF-EBCDIC) invariant code points are exactly one byte whether encoded +in UTF\-8 (or UTF-EBCDIC), or not. (By now you see why people typically +just say "UTF\-8" when they also mean "UTF-EBCDIC". For the rest of this +document, we'll mostly be casual about it too.) +In ASCII UTF\-8, the code points corresponding to the lowest 128 +ordinal numbers (0 \- 127: the ASCII characters) are invariant. +In UTF-EBCDIC, there are 160 invariant characters. +(If you care, the EBCDIC invariants are those characters +which have ASCII equivalents, plus those that correspond to +the C1 controls (128 \- 159 on ASCII platforms).) +.PP +A string encoded in UTF-EBCDIC may be longer (very rarely shorter) than +one encoded in UTF\-8. Perl extends both UTF\-8 and UTF-EBCDIC so that +they can encode code points above the Unicode maximum of U+10FFFF. Both +extensions are constructed to allow encoding of any code point that fits +in a 64\-bit word. +.PP +UTF-EBCDIC is defined by +Unicode Technical Report #16 <https://www.unicode.org/reports/tr16> +(often referred to as just TR16). +It is defined based on CCSID 1047, not allowing for the differences for +other code pages. This allows for easy interchange of text between +computers running different code pages, but makes it unusable, without +adaptation, for Perl on those other code pages. +.PP +The reason for this unusability is that a fundamental assumption of Perl +is that the characters it cares about for parsing and lexical analysis +are the same whether or not the text is in UTF\-8. For example, Perl +expects the character \f(CW"["\fR to have the same representation, no matter +if the string containing it (or program text) is UTF\-8 encoded or not. +To ensure this, Perl adapts UTF-EBCDIC to the particular code page so +that all characters it expects to be UTF\-8 invariant are in fact UTF\-8 +invariant. This means that text generated on a computer running one +version of Perl's UTF-EBCDIC has to be translated to be intelligible to +a computer running another. +.PP +TR16 implies a method to extend UTF-EBCDIC to encode points up through +\&\f(CW\*(C`2\ **\ 31\ \-\ 1\*(C'\fR. Perl uses this method for code points up through +\&\f(CW\*(C`2\ **\ 30\ \-\ 1\*(C'\fR, but uses an incompatible method for larger ones, to +enable it to handle much larger code points than otherwise. +.SS "Using Encode" +.IX Subsection "Using Encode" +Starting from Perl 5.8 you can use the standard module Encode +to translate from EBCDIC to Latin\-1 code points. +Encode knows about more EBCDIC character sets than Perl can currently +be compiled to run on. +.PP +.Vb 1 +\& use Encode \*(Aqfrom_to\*(Aq; +\& +\& my %ebcdic = ( 176 => \*(Aqcp37\*(Aq, 95 => \*(Aqcp1047\*(Aq, 106 => \*(Aqposix\-bc\*(Aq ); +\& +\& # $a is in EBCDIC code points +\& from_to($a, $ebcdic{ord \*(Aq^\*(Aq}, \*(Aqlatin1\*(Aq); +\& # $a is ISO 8859\-1 code points +.Ve +.PP +and from Latin\-1 code points to EBCDIC code points +.PP +.Vb 1 +\& use Encode \*(Aqfrom_to\*(Aq; +\& +\& my %ebcdic = ( 176 => \*(Aqcp37\*(Aq, 95 => \*(Aqcp1047\*(Aq, 106 => \*(Aqposix\-bc\*(Aq ); +\& +\& # $a is ISO 8859\-1 code points +\& from_to($a, \*(Aqlatin1\*(Aq, $ebcdic{ord \*(Aq^\*(Aq}); +\& # $a is in EBCDIC code points +.Ve +.PP +For doing I/O it is suggested that you use the autotranslating features +of PerlIO, see perluniintro. +.PP +Since version 5.8 Perl uses the PerlIO I/O library. This enables +you to use different encodings per IO channel. For example you may use +.PP +.Vb 9 +\& use Encode; +\& open($f, ">:encoding(ascii)", "test.ascii"); +\& print $f "Hello World!\en"; +\& open($f, ">:encoding(cp37)", "test.ebcdic"); +\& print $f "Hello World!\en"; +\& open($f, ">:encoding(latin1)", "test.latin1"); +\& print $f "Hello World!\en"; +\& open($f, ">:encoding(utf8)", "test.utf8"); +\& print $f "Hello World!\en"; +.Ve +.PP +to get four files containing "Hello World!\en" in ASCII, CP 0037 EBCDIC, +ISO 8859\-1 (Latin\-1) (in this example identical to ASCII since only ASCII +characters were printed), and +UTF-EBCDIC (in this example identical to normal EBCDIC since only characters +that don't differ between EBCDIC and UTF-EBCDIC were printed). See the +documentation of Encode::PerlIO for details. +.PP +As the PerlIO layer uses raw IO (bytes) internally, all this totally +ignores things like the type of your filesystem (ASCII or EBCDIC). +.SH "SINGLE OCTET TABLES" +.IX Header "SINGLE OCTET TABLES" +The following tables list the ASCII and Latin 1 ordered sets including +the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), +C1 controls (80..9f), and Latin\-1 (a.k.a. ISO 8859\-1) (a0..ff). In the +table names of the Latin 1 +extensions to ASCII have been labelled with character names roughly +corresponding to \fIThe Unicode Standard, Version 6.1\fR albeit with +substitutions such as \f(CW\*(C`s/LATIN//\*(C'\fR and \f(CW\*(C`s/VULGAR//\*(C'\fR in all cases; +\&\f(CW\*(C`s/CAPITAL\ LETTER//\*(C'\fR in some cases; and +\&\f(CW\*(C`s/SMALL\ LETTER\ ([A\-Z])/\el$1/\*(C'\fR in some other +cases. Controls are listed using their Unicode 6.2 abbreviations. +The differences between the 0037 and 1047 sets are +flagged with \f(CW\*(C`**\*(C'\fR. The differences between the 1047 and POSIX-BC sets +are flagged with \f(CW\*(C`##.\*(C'\fR All \f(CWord()\fR numbers listed are decimal. If you +would rather see this table listing octal values, then run the table +(that is, the pod source text of this document, since this recipe may not +work with a pod2_other_format translation) through: +.IP "recipe 0" 4 +.IX Item "recipe 0" +.PP +.Vb 3 +\& perl \-ne \*(Aqif(/(.{29})(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)/)\*(Aq \e +\& \-e \*(Aq{printf("%s%\-5.03o%\-5.03o%\-5.03o%.03o\en",$1,$2,$3,$4,$5)}\*(Aq \e +\& perlebcdic.pod +.Ve +.PP +If you want to retain the UTF-x code points then in script form you +might want to write: +.IP "recipe 1" 4 +.IX Item "recipe 1" +.PP +.Vb 10 +\& open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; +\& while (<FH>) { +\& if (/(.{29})(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)\e.?(\ed*) +\& \es+(\ed+)\e.?(\ed*)/x) +\& { +\& if ($7 ne \*(Aq\*(Aq && $9 ne \*(Aq\*(Aq) { +\& printf( +\& "%s%\-5.03o%\-5.03o%\-5.03o%\-5.03o%\-3o.%\-5o%\-3o.%.03o\en", +\& $1,$2,$3,$4,$5,$6,$7,$8,$9); +\& } +\& elsif ($7 ne \*(Aq\*(Aq) { +\& printf("%s%\-5.03o%\-5.03o%\-5.03o%\-5.03o%\-3o.%\-5o%.03o\en", +\& $1,$2,$3,$4,$5,$6,$7,$8); +\& } +\& else { +\& printf("%s%\-5.03o%\-5.03o%\-5.03o%\-5.03o%\-5.03o%.03o\en", +\& $1,$2,$3,$4,$5,$6,$8); +\& } +\& } +\& } +.Ve +.PP +If you would rather see this table listing hexadecimal values then +run the table through: +.IP "recipe 2" 4 +.IX Item "recipe 2" +.PP +.Vb 3 +\& perl \-ne \*(Aqif(/(.{29})(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)/)\*(Aq \e +\& \-e \*(Aq{printf("%s%\-5.02X%\-5.02X%\-5.02X%.02X\en",$1,$2,$3,$4,$5)}\*(Aq \e +\& perlebcdic.pod +.Ve +.PP +Or, in order to retain the UTF-x code points in hexadecimal: +.IP "recipe 3" 4 +.IX Item "recipe 3" +.PP +.Vb 10 +\& open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; +\& while (<FH>) { +\& if (/(.{29})(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)\es+(\ed+)\e.?(\ed*) +\& \es+(\ed+)\e.?(\ed*)/x) +\& { +\& if ($7 ne \*(Aq\*(Aq && $9 ne \*(Aq\*(Aq) { +\& printf( +\& "%s%\-5.02X%\-5.02X%\-5.02X%\-5.02X%\-2X.%\-6.02X%02X.%02X\en", +\& $1,$2,$3,$4,$5,$6,$7,$8,$9); +\& } +\& elsif ($7 ne \*(Aq\*(Aq) { +\& printf("%s%\-5.02X%\-5.02X%\-5.02X%\-5.02X%\-2X.%\-6.02X%02X\en", +\& $1,$2,$3,$4,$5,$6,$7,$8); +\& } +\& else { +\& printf("%s%\-5.02X%\-5.02X%\-5.02X%\-5.02X%\-5.02X%02X\en", +\& $1,$2,$3,$4,$5,$6,$8); +\& } +\& } +\& } +\& +\& +\& ISO +\& 8859\-1 POS\- CCSID +\& CCSID CCSID CCSID IX\- 1047 +\& chr 0819 0037 1047 BC UTF\-8 UTF\-EBCDIC +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& <NUL> 0 0 0 0 0 0 +\& <SOH> 1 1 1 1 1 1 +\& <STX> 2 2 2 2 2 2 +\& <ETX> 3 3 3 3 3 3 +\& <EOT> 4 55 55 55 4 55 +\& <ENQ> 5 45 45 45 5 45 +\& <ACK> 6 46 46 46 6 46 +\& <BEL> 7 47 47 47 7 47 +\& <BS> 8 22 22 22 8 22 +\& <HT> 9 5 5 5 9 5 +\& <LF> 10 37 21 21 10 21 ** +\& <VT> 11 11 11 11 11 11 +\& <FF> 12 12 12 12 12 12 +\& <CR> 13 13 13 13 13 13 +\& <SO> 14 14 14 14 14 14 +\& <SI> 15 15 15 15 15 15 +\& <DLE> 16 16 16 16 16 16 +\& <DC1> 17 17 17 17 17 17 +\& <DC2> 18 18 18 18 18 18 +\& <DC3> 19 19 19 19 19 19 +\& <DC4> 20 60 60 60 20 60 +\& <NAK> 21 61 61 61 21 61 +\& <SYN> 22 50 50 50 22 50 +\& <ETB> 23 38 38 38 23 38 +\& <CAN> 24 24 24 24 24 24 +\& <EOM> 25 25 25 25 25 25 +\& <SUB> 26 63 63 63 26 63 +\& <ESC> 27 39 39 39 27 39 +\& <FS> 28 28 28 28 28 28 +\& <GS> 29 29 29 29 29 29 +\& <RS> 30 30 30 30 30 30 +\& <US> 31 31 31 31 31 31 +\& <SPACE> 32 64 64 64 32 64 +\& ! 33 90 90 90 33 90 +\& " 34 127 127 127 34 127 +\& # 35 123 123 123 35 123 +\& $ 36 91 91 91 36 91 +\& % 37 108 108 108 37 108 +\& & 38 80 80 80 38 80 +\& \*(Aq 39 125 125 125 39 125 +\& ( 40 77 77 77 40 77 +\& ) 41 93 93 93 41 93 +\& * 42 92 92 92 42 92 +\& + 43 78 78 78 43 78 +\& , 44 107 107 107 44 107 +\& \- 45 96 96 96 45 96 +\& . 46 75 75 75 46 75 +\& / 47 97 97 97 47 97 +\& 0 48 240 240 240 48 240 +\& 1 49 241 241 241 49 241 +\& 2 50 242 242 242 50 242 +\& 3 51 243 243 243 51 243 +\& 4 52 244 244 244 52 244 +\& 5 53 245 245 245 53 245 +\& 6 54 246 246 246 54 246 +\& 7 55 247 247 247 55 247 +\& 8 56 248 248 248 56 248 +\& 9 57 249 249 249 57 249 +\& : 58 122 122 122 58 122 +\& ; 59 94 94 94 59 94 +\& < 60 76 76 76 60 76 +\& = 61 126 126 126 61 126 +\& > 62 110 110 110 62 110 +\& ? 63 111 111 111 63 111 +\& @ 64 124 124 124 64 124 +\& A 65 193 193 193 65 193 +\& B 66 194 194 194 66 194 +\& C 67 195 195 195 67 195 +\& D 68 196 196 196 68 196 +\& E 69 197 197 197 69 197 +\& F 70 198 198 198 70 198 +\& G 71 199 199 199 71 199 +\& H 72 200 200 200 72 200 +\& I 73 201 201 201 73 201 +\& J 74 209 209 209 74 209 +\& K 75 210 210 210 75 210 +\& L 76 211 211 211 76 211 +\& M 77 212 212 212 77 212 +\& N 78 213 213 213 78 213 +\& O 79 214 214 214 79 214 +\& P 80 215 215 215 80 215 +\& Q 81 216 216 216 81 216 +\& R 82 217 217 217 82 217 +\& S 83 226 226 226 83 226 +\& T 84 227 227 227 84 227 +\& U 85 228 228 228 85 228 +\& V 86 229 229 229 86 229 +\& W 87 230 230 230 87 230 +\& X 88 231 231 231 88 231 +\& Y 89 232 232 232 89 232 +\& Z 90 233 233 233 90 233 +\& [ 91 186 173 187 91 173 ** ## +\& \e 92 224 224 188 92 224 ## +\& ] 93 187 189 189 93 189 ** +\& ^ 94 176 95 106 94 95 ** ## +\& _ 95 109 109 109 95 109 +\& \` 96 121 121 74 96 121 ## +\& a 97 129 129 129 97 129 +\& b 98 130 130 130 98 130 +\& c 99 131 131 131 99 131 +\& d 100 132 132 132 100 132 +\& e 101 133 133 133 101 133 +\& f 102 134 134 134 102 134 +\& g 103 135 135 135 103 135 +\& h 104 136 136 136 104 136 +\& i 105 137 137 137 105 137 +\& j 106 145 145 145 106 145 +\& k 107 146 146 146 107 146 +\& l 108 147 147 147 108 147 +\& m 109 148 148 148 109 148 +\& n 110 149 149 149 110 149 +\& o 111 150 150 150 111 150 +\& p 112 151 151 151 112 151 +\& q 113 152 152 152 113 152 +\& r 114 153 153 153 114 153 +\& s 115 162 162 162 115 162 +\& t 116 163 163 163 116 163 +\& u 117 164 164 164 117 164 +\& v 118 165 165 165 118 165 +\& w 119 166 166 166 119 166 +\& x 120 167 167 167 120 167 +\& y 121 168 168 168 121 168 +\& z 122 169 169 169 122 169 +\& { 123 192 192 251 123 192 ## +\& | 124 79 79 79 124 79 +\& } 125 208 208 253 125 208 ## +\& ~ 126 161 161 255 126 161 ## +\& <DEL> 127 7 7 7 127 7 +\& <PAD> 128 32 32 32 194.128 32 +\& <HOP> 129 33 33 33 194.129 33 +\& <BPH> 130 34 34 34 194.130 34 +\& <NBH> 131 35 35 35 194.131 35 +\& <IND> 132 36 36 36 194.132 36 +\& <NEL> 133 21 37 37 194.133 37 ** +\& <SSA> 134 6 6 6 194.134 6 +\& <ESA> 135 23 23 23 194.135 23 +\& <HTS> 136 40 40 40 194.136 40 +\& <HTJ> 137 41 41 41 194.137 41 +\& <VTS> 138 42 42 42 194.138 42 +\& <PLD> 139 43 43 43 194.139 43 +\& <PLU> 140 44 44 44 194.140 44 +\& <RI> 141 9 9 9 194.141 9 +\& <SS2> 142 10 10 10 194.142 10 +\& <SS3> 143 27 27 27 194.143 27 +\& <DCS> 144 48 48 48 194.144 48 +\& <PU1> 145 49 49 49 194.145 49 +\& <PU2> 146 26 26 26 194.146 26 +\& <STS> 147 51 51 51 194.147 51 +\& <CCH> 148 52 52 52 194.148 52 +\& <MW> 149 53 53 53 194.149 53 +\& <SPA> 150 54 54 54 194.150 54 +\& <EPA> 151 8 8 8 194.151 8 +\& <SOS> 152 56 56 56 194.152 56 +\& <SGC> 153 57 57 57 194.153 57 +\& <SCI> 154 58 58 58 194.154 58 +\& <CSI> 155 59 59 59 194.155 59 +\& <ST> 156 4 4 4 194.156 4 +\& <OSC> 157 20 20 20 194.157 20 +\& <PM> 158 62 62 62 194.158 62 +\& <APC> 159 255 255 95 194.159 255 ## +\& <NON\-BREAKING SPACE> 160 65 65 65 194.160 128.65 +\& <INVERTED "!" > 161 170 170 170 194.161 128.66 +\& <CENT SIGN> 162 74 74 176 194.162 128.67 ## +\& <POUND SIGN> 163 177 177 177 194.163 128.68 +\& <CURRENCY SIGN> 164 159 159 159 194.164 128.69 +\& <YEN SIGN> 165 178 178 178 194.165 128.70 +\& <BROKEN BAR> 166 106 106 208 194.166 128.71 ## +\& <SECTION SIGN> 167 181 181 181 194.167 128.72 +\& <DIAERESIS> 168 189 187 121 194.168 128.73 ** ## +\& <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74 +\& <FEMININE ORDINAL> 170 154 154 154 194.170 128.81 +\& <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82 +\& <NOT SIGN> 172 95 176 186 194.172 128.83 ** ## +\& <SOFT HYPHEN> 173 202 202 202 194.173 128.84 +\& <REGISTERED TRADE MARK> 174 175 175 175 194.174 128.85 +\& <MACRON> 175 188 188 161 194.175 128.86 ## +\& <DEGREE SIGN> 176 144 144 144 194.176 128.87 +\& <PLUS\-OR\-MINUS SIGN> 177 143 143 143 194.177 128.88 +\& <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89 +\& <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98 +\& <ACUTE ACCENT> 180 190 190 190 194.180 128.99 +\& <MICRO SIGN> 181 160 160 160 194.181 128.100 +\& <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101 +\& <MIDDLE DOT> 183 179 179 179 194.183 128.102 +\& <CEDILLA> 184 157 157 157 194.184 128.103 +\& <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104 +\& <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105 +\& <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106 +\& <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112 +\& <FRACTION ONE HALF> 189 184 184 184 194.189 128.113 +\& <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114 +\& <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115 +\& <A WITH GRAVE> 192 100 100 100 195.128 138.65 +\& <A WITH ACUTE> 193 101 101 101 195.129 138.66 +\& <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67 +\& <A WITH TILDE> 195 102 102 102 195.131 138.68 +\& <A WITH DIAERESIS> 196 99 99 99 195.132 138.69 +\& <A WITH RING ABOVE> 197 103 103 103 195.133 138.70 +\& <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71 +\& <C WITH CEDILLA> 199 104 104 104 195.135 138.72 +\& <E WITH GRAVE> 200 116 116 116 195.136 138.73 +\& <E WITH ACUTE> 201 113 113 113 195.137 138.74 +\& <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81 +\& <E WITH DIAERESIS> 203 115 115 115 195.139 138.82 +\& <I WITH GRAVE> 204 120 120 120 195.140 138.83 +\& <I WITH ACUTE> 205 117 117 117 195.141 138.84 +\& <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85 +\& <I WITH DIAERESIS> 207 119 119 119 195.143 138.86 +\& <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87 +\& <N WITH TILDE> 209 105 105 105 195.145 138.88 +\& <O WITH GRAVE> 210 237 237 237 195.146 138.89 +\& <O WITH ACUTE> 211 238 238 238 195.147 138.98 +\& <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99 +\& <O WITH TILDE> 213 239 239 239 195.149 138.100 +\& <O WITH DIAERESIS> 214 236 236 236 195.150 138.101 +\& <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102 +\& <O WITH STROKE> 216 128 128 128 195.152 138.103 +\& <U WITH GRAVE> 217 253 253 224 195.153 138.104 ## +\& <U WITH ACUTE> 218 254 254 254 195.154 138.105 +\& <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ## +\& <U WITH DIAERESIS> 220 252 252 252 195.156 138.112 +\& <Y WITH ACUTE> 221 173 186 173 195.157 138.113 ** ## +\& <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114 +\& <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115 +\& <a WITH GRAVE> 224 68 68 68 195.160 139.65 +\& <a WITH ACUTE> 225 69 69 69 195.161 139.66 +\& <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67 +\& <a WITH TILDE> 227 70 70 70 195.163 139.68 +\& <a WITH DIAERESIS> 228 67 67 67 195.164 139.69 +\& <a WITH RING ABOVE> 229 71 71 71 195.165 139.70 +\& <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71 +\& <c WITH CEDILLA> 231 72 72 72 195.167 139.72 +\& <e WITH GRAVE> 232 84 84 84 195.168 139.73 +\& <e WITH ACUTE> 233 81 81 81 195.169 139.74 +\& <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81 +\& <e WITH DIAERESIS> 235 83 83 83 195.171 139.82 +\& <i WITH GRAVE> 236 88 88 88 195.172 139.83 +\& <i WITH ACUTE> 237 85 85 85 195.173 139.84 +\& <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85 +\& <i WITH DIAERESIS> 239 87 87 87 195.175 139.86 +\& <SMALL LETTER eth> 240 140 140 140 195.176 139.87 +\& <n WITH TILDE> 241 73 73 73 195.177 139.88 +\& <o WITH GRAVE> 242 205 205 205 195.178 139.89 +\& <o WITH ACUTE> 243 206 206 206 195.179 139.98 +\& <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99 +\& <o WITH TILDE> 245 207 207 207 195.181 139.100 +\& <o WITH DIAERESIS> 246 204 204 204 195.182 139.101 +\& <DIVISION SIGN> 247 225 225 225 195.183 139.102 +\& <o WITH STROKE> 248 112 112 112 195.184 139.103 +\& <u WITH GRAVE> 249 221 221 192 195.185 139.104 ## +\& <u WITH ACUTE> 250 222 222 222 195.186 139.105 +\& <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106 +\& <u WITH DIAERESIS> 252 220 220 220 195.188 139.112 +\& <y WITH ACUTE> 253 141 141 141 195.189 139.113 +\& <SMALL LETTER thorn> 254 142 142 142 195.190 139.114 +\& <y WITH DIAERESIS> 255 223 223 223 195.191 139.115 +.Ve +.PP +If you would rather see the above table in CCSID 0037 order rather than +ASCII + Latin\-1 order then run the table through: +.IP "recipe 4" 4 +.IX Item "recipe 4" +.PP +.Vb 6 +\& perl \e +\& \-ne \*(Aqif(/.{29}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}/)\*(Aq\e +\& \-e \*(Aq{push(@l,$_)}\*(Aq \e +\& \-e \*(AqEND{print map{$_\->[0]}\*(Aq \e +\& \-e \*(Aq sort{$a\->[1] <=> $b\->[1]}\*(Aq \e +\& \-e \*(Aq map{[$_,substr($_,34,3)]}@l;}\*(Aq perlebcdic.pod +.Ve +.PP +If you would rather see it in CCSID 1047 order then change the number +34 in the last line to 39, like this: +.IP "recipe 5" 4 +.IX Item "recipe 5" +.PP +.Vb 6 +\& perl \e +\& \-ne \*(Aqif(/.{29}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}/)\*(Aq\e +\& \-e \*(Aq{push(@l,$_)}\*(Aq \e +\& \-e \*(AqEND{print map{$_\->[0]}\*(Aq \e +\& \-e \*(Aq sort{$a\->[1] <=> $b\->[1]}\*(Aq \e +\& \-e \*(Aq map{[$_,substr($_,39,3)]}@l;}\*(Aq perlebcdic.pod +.Ve +.PP +If you would rather see it in POSIX-BC order then change the number +34 in the last line to 44, like this: +.IP "recipe 6" 4 +.IX Item "recipe 6" +.PP +.Vb 6 +\& perl \e +\& \-ne \*(Aqif(/.{29}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}\es{2,4}\ed{1,3}/)\*(Aq\e +\& \-e \*(Aq{push(@l,$_)}\*(Aq \e +\& \-e \*(AqEND{print map{$_\->[0]}\*(Aq \e +\& \-e \*(Aq sort{$a\->[1] <=> $b\->[1]}\*(Aq \e +\& \-e \*(Aq map{[$_,substr($_,44,3)]}@l;}\*(Aq perlebcdic.pod +.Ve +.SS "Table in hex, sorted in 1047 order" +.IX Subsection "Table in hex, sorted in 1047 order" +Since this document was first written, the convention has become more +and more to use hexadecimal notation for code points. To do this with +the recipes and to also sort is a multi-step process, so here, for +convenience, is the table from above, re-sorted to be in Code Page 1047 +order, and using hex notation. +.PP +.Vb 10 +\& ISO +\& 8859\-1 POS\- CCSID +\& CCSID CCSID CCSID IX\- 1047 +\& chr 0819 0037 1047 BC UTF\-8 UTF\-EBCDIC +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& <NUL> 00 00 00 00 00 00 +\& <SOH> 01 01 01 01 01 01 +\& <STX> 02 02 02 02 02 02 +\& <ETX> 03 03 03 03 03 03 +\& <ST> 9C 04 04 04 C2.9C 04 +\& <HT> 09 05 05 05 09 05 +\& <SSA> 86 06 06 06 C2.86 06 +\& <DEL> 7F 07 07 07 7F 07 +\& <EPA> 97 08 08 08 C2.97 08 +\& <RI> 8D 09 09 09 C2.8D 09 +\& <SS2> 8E 0A 0A 0A C2.8E 0A +\& <VT> 0B 0B 0B 0B 0B 0B +\& <FF> 0C 0C 0C 0C 0C 0C +\& <CR> 0D 0D 0D 0D 0D 0D +\& <SO> 0E 0E 0E 0E 0E 0E +\& <SI> 0F 0F 0F 0F 0F 0F +\& <DLE> 10 10 10 10 10 10 +\& <DC1> 11 11 11 11 11 11 +\& <DC2> 12 12 12 12 12 12 +\& <DC3> 13 13 13 13 13 13 +\& <OSC> 9D 14 14 14 C2.9D 14 +\& <LF> 0A 25 15 15 0A 15 ** +\& <BS> 08 16 16 16 08 16 +\& <ESA> 87 17 17 17 C2.87 17 +\& <CAN> 18 18 18 18 18 18 +\& <EOM> 19 19 19 19 19 19 +\& <PU2> 92 1A 1A 1A C2.92 1A +\& <SS3> 8F 1B 1B 1B C2.8F 1B +\& <FS> 1C 1C 1C 1C 1C 1C +\& <GS> 1D 1D 1D 1D 1D 1D +\& <RS> 1E 1E 1E 1E 1E 1E +\& <US> 1F 1F 1F 1F 1F 1F +\& <PAD> 80 20 20 20 C2.80 20 +\& <HOP> 81 21 21 21 C2.81 21 +\& <BPH> 82 22 22 22 C2.82 22 +\& <NBH> 83 23 23 23 C2.83 23 +\& <IND> 84 24 24 24 C2.84 24 +\& <NEL> 85 15 25 25 C2.85 25 ** +\& <ETB> 17 26 26 26 17 26 +\& <ESC> 1B 27 27 27 1B 27 +\& <HTS> 88 28 28 28 C2.88 28 +\& <HTJ> 89 29 29 29 C2.89 29 +\& <VTS> 8A 2A 2A 2A C2.8A 2A +\& <PLD> 8B 2B 2B 2B C2.8B 2B +\& <PLU> 8C 2C 2C 2C C2.8C 2C +\& <ENQ> 05 2D 2D 2D 05 2D +\& <ACK> 06 2E 2E 2E 06 2E +\& <BEL> 07 2F 2F 2F 07 2F +\& <DCS> 90 30 30 30 C2.90 30 +\& <PU1> 91 31 31 31 C2.91 31 +\& <SYN> 16 32 32 32 16 32 +\& <STS> 93 33 33 33 C2.93 33 +\& <CCH> 94 34 34 34 C2.94 34 +\& <MW> 95 35 35 35 C2.95 35 +\& <SPA> 96 36 36 36 C2.96 36 +\& <EOT> 04 37 37 37 04 37 +\& <SOS> 98 38 38 38 C2.98 38 +\& <SGC> 99 39 39 39 C2.99 39 +\& <SCI> 9A 3A 3A 3A C2.9A 3A +\& <CSI> 9B 3B 3B 3B C2.9B 3B +\& <DC4> 14 3C 3C 3C 14 3C +\& <NAK> 15 3D 3D 3D 15 3D +\& <PM> 9E 3E 3E 3E C2.9E 3E +\& <SUB> 1A 3F 3F 3F 1A 3F +\& <SPACE> 20 40 40 40 20 40 +\& <NON\-BREAKING SPACE> A0 41 41 41 C2.A0 80.41 +\& <a WITH CIRCUMFLEX> E2 42 42 42 C3.A2 8B.43 +\& <a WITH DIAERESIS> E4 43 43 43 C3.A4 8B.45 +\& <a WITH GRAVE> E0 44 44 44 C3.A0 8B.41 +\& <a WITH ACUTE> E1 45 45 45 C3.A1 8B.42 +\& <a WITH TILDE> E3 46 46 46 C3.A3 8B.44 +\& <a WITH RING ABOVE> E5 47 47 47 C3.A5 8B.46 +\& <c WITH CEDILLA> E7 48 48 48 C3.A7 8B.48 +\& <n WITH TILDE> F1 49 49 49 C3.B1 8B.58 +\& <CENT SIGN> A2 4A 4A B0 C2.A2 80.43 ## +\& . 2E 4B 4B 4B 2E 4B +\& < 3C 4C 4C 4C 3C 4C +\& ( 28 4D 4D 4D 28 4D +\& + 2B 4E 4E 4E 2B 4E +\& | 7C 4F 4F 4F 7C 4F +\& & 26 50 50 50 26 50 +\& <e WITH ACUTE> E9 51 51 51 C3.A9 8B.4A +\& <e WITH CIRCUMFLEX> EA 52 52 52 C3.AA 8B.51 +\& <e WITH DIAERESIS> EB 53 53 53 C3.AB 8B.52 +\& <e WITH GRAVE> E8 54 54 54 C3.A8 8B.49 +\& <i WITH ACUTE> ED 55 55 55 C3.AD 8B.54 +\& <i WITH CIRCUMFLEX> EE 56 56 56 C3.AE 8B.55 +\& <i WITH DIAERESIS> EF 57 57 57 C3.AF 8B.56 +\& <i WITH GRAVE> EC 58 58 58 C3.AC 8B.53 +\& <SMALL LETTER SHARP S> DF 59 59 59 C3.9F 8A.73 +\& ! 21 5A 5A 5A 21 5A +\& $ 24 5B 5B 5B 24 5B +\& * 2A 5C 5C 5C 2A 5C +\& ) 29 5D 5D 5D 29 5D +\& ; 3B 5E 5E 5E 3B 5E +\& ^ 5E B0 5F 6A 5E 5F ** ## +\& \- 2D 60 60 60 2D 60 +\& / 2F 61 61 61 2F 61 +\& <A WITH CIRCUMFLEX> C2 62 62 62 C3.82 8A.43 +\& <A WITH DIAERESIS> C4 63 63 63 C3.84 8A.45 +\& <A WITH GRAVE> C0 64 64 64 C3.80 8A.41 +\& <A WITH ACUTE> C1 65 65 65 C3.81 8A.42 +\& <A WITH TILDE> C3 66 66 66 C3.83 8A.44 +\& <A WITH RING ABOVE> C5 67 67 67 C3.85 8A.46 +\& <C WITH CEDILLA> C7 68 68 68 C3.87 8A.48 +\& <N WITH TILDE> D1 69 69 69 C3.91 8A.58 +\& <BROKEN BAR> A6 6A 6A D0 C2.A6 80.47 ## +\& , 2C 6B 6B 6B 2C 6B +\& % 25 6C 6C 6C 25 6C +\& _ 5F 6D 6D 6D 5F 6D +\& > 3E 6E 6E 6E 3E 6E +\& ? 3F 6F 6F 6F 3F 6F +\& <o WITH STROKE> F8 70 70 70 C3.B8 8B.67 +\& <E WITH ACUTE> C9 71 71 71 C3.89 8A.4A +\& <E WITH CIRCUMFLEX> CA 72 72 72 C3.8A 8A.51 +\& <E WITH DIAERESIS> CB 73 73 73 C3.8B 8A.52 +\& <E WITH GRAVE> C8 74 74 74 C3.88 8A.49 +\& <I WITH ACUTE> CD 75 75 75 C3.8D 8A.54 +\& <I WITH CIRCUMFLEX> CE 76 76 76 C3.8E 8A.55 +\& <I WITH DIAERESIS> CF 77 77 77 C3.8F 8A.56 +\& <I WITH GRAVE> CC 78 78 78 C3.8C 8A.53 +\& \` 60 79 79 4A 60 79 ## +\& : 3A 7A 7A 7A 3A 7A +\& # 23 7B 7B 7B 23 7B +\& @ 40 7C 7C 7C 40 7C +\& \*(Aq 27 7D 7D 7D 27 7D +\& = 3D 7E 7E 7E 3D 7E +\& " 22 7F 7F 7F 22 7F +\& <O WITH STROKE> D8 80 80 80 C3.98 8A.67 +\& a 61 81 81 81 61 81 +\& b 62 82 82 82 62 82 +\& c 63 83 83 83 63 83 +\& d 64 84 84 84 64 84 +\& e 65 85 85 85 65 85 +\& f 66 86 86 86 66 86 +\& g 67 87 87 87 67 87 +\& h 68 88 88 88 68 88 +\& i 69 89 89 89 69 89 +\& <LEFT POINTING GUILLEMET> AB 8A 8A 8A C2.AB 80.52 +\& <RIGHT POINTING GUILLEMET> BB 8B 8B 8B C2.BB 80.6A +\& <SMALL LETTER eth> F0 8C 8C 8C C3.B0 8B.57 +\& <y WITH ACUTE> FD 8D 8D 8D C3.BD 8B.71 +\& <SMALL LETTER thorn> FE 8E 8E 8E C3.BE 8B.72 +\& <PLUS\-OR\-MINUS SIGN> B1 8F 8F 8F C2.B1 80.58 +\& <DEGREE SIGN> B0 90 90 90 C2.B0 80.57 +\& j 6A 91 91 91 6A 91 +\& k 6B 92 92 92 6B 92 +\& l 6C 93 93 93 6C 93 +\& m 6D 94 94 94 6D 94 +\& n 6E 95 95 95 6E 95 +\& o 6F 96 96 96 6F 96 +\& p 70 97 97 97 70 97 +\& q 71 98 98 98 71 98 +\& r 72 99 99 99 72 99 +\& <FEMININE ORDINAL> AA 9A 9A 9A C2.AA 80.51 +\& <MASC. ORDINAL INDICATOR> BA 9B 9B 9B C2.BA 80.69 +\& <SMALL LIGATURE ae> E6 9C 9C 9C C3.A6 8B.47 +\& <CEDILLA> B8 9D 9D 9D C2.B8 80.67 +\& <CAPITAL LIGATURE AE> C6 9E 9E 9E C3.86 8A.47 +\& <CURRENCY SIGN> A4 9F 9F 9F C2.A4 80.45 +\& <MICRO SIGN> B5 A0 A0 A0 C2.B5 80.64 +\& ~ 7E A1 A1 FF 7E A1 ## +\& s 73 A2 A2 A2 73 A2 +\& t 74 A3 A3 A3 74 A3 +\& u 75 A4 A4 A4 75 A4 +\& v 76 A5 A5 A5 76 A5 +\& w 77 A6 A6 A6 77 A6 +\& x 78 A7 A7 A7 78 A7 +\& y 79 A8 A8 A8 79 A8 +\& z 7A A9 A9 A9 7A A9 +\& <INVERTED "!" > A1 AA AA AA C2.A1 80.42 +\& <INVERTED QUESTION MARK> BF AB AB AB C2.BF 80.73 +\& <CAPITAL LETTER ETH> D0 AC AC AC C3.90 8A.57 +\& [ 5B BA AD BB 5B AD ** ## +\& <CAPITAL LETTER THORN> DE AE AE AE C3.9E 8A.72 +\& <REGISTERED TRADE MARK> AE AF AF AF C2.AE 80.55 +\& <NOT SIGN> AC 5F B0 BA C2.AC 80.53 ** ## +\& <POUND SIGN> A3 B1 B1 B1 C2.A3 80.44 +\& <YEN SIGN> A5 B2 B2 B2 C2.A5 80.46 +\& <MIDDLE DOT> B7 B3 B3 B3 C2.B7 80.66 +\& <COPYRIGHT SIGN> A9 B4 B4 B4 C2.A9 80.4A +\& <SECTION SIGN> A7 B5 B5 B5 C2.A7 80.48 +\& <PARAGRAPH SIGN> B6 B6 B6 B6 C2.B6 80.65 +\& <FRACTION ONE QUARTER> BC B7 B7 B7 C2.BC 80.70 +\& <FRACTION ONE HALF> BD B8 B8 B8 C2.BD 80.71 +\& <FRACTION THREE QUARTERS> BE B9 B9 B9 C2.BE 80.72 +\& <Y WITH ACUTE> DD AD BA AD C3.9D 8A.71 ** ## +\& <DIAERESIS> A8 BD BB 79 C2.A8 80.49 ** ## +\& <MACRON> AF BC BC A1 C2.AF 80.56 ## +\& ] 5D BB BD BD 5D BD ** +\& <ACUTE ACCENT> B4 BE BE BE C2.B4 80.63 +\& <MULTIPLICATION SIGN> D7 BF BF BF C3.97 8A.66 +\& { 7B C0 C0 FB 7B C0 ## +\& A 41 C1 C1 C1 41 C1 +\& B 42 C2 C2 C2 42 C2 +\& C 43 C3 C3 C3 43 C3 +\& D 44 C4 C4 C4 44 C4 +\& E 45 C5 C5 C5 45 C5 +\& F 46 C6 C6 C6 46 C6 +\& G 47 C7 C7 C7 47 C7 +\& H 48 C8 C8 C8 48 C8 +\& I 49 C9 C9 C9 49 C9 +\& <SOFT HYPHEN> AD CA CA CA C2.AD 80.54 +\& <o WITH CIRCUMFLEX> F4 CB CB CB C3.B4 8B.63 +\& <o WITH DIAERESIS> F6 CC CC CC C3.B6 8B.65 +\& <o WITH GRAVE> F2 CD CD CD C3.B2 8B.59 +\& <o WITH ACUTE> F3 CE CE CE C3.B3 8B.62 +\& <o WITH TILDE> F5 CF CF CF C3.B5 8B.64 +\& } 7D D0 D0 FD 7D D0 ## +\& J 4A D1 D1 D1 4A D1 +\& K 4B D2 D2 D2 4B D2 +\& L 4C D3 D3 D3 4C D3 +\& M 4D D4 D4 D4 4D D4 +\& N 4E D5 D5 D5 4E D5 +\& O 4F D6 D6 D6 4F D6 +\& P 50 D7 D7 D7 50 D7 +\& Q 51 D8 D8 D8 51 D8 +\& R 52 D9 D9 D9 52 D9 +\& <SUPERSCRIPT ONE> B9 DA DA DA C2.B9 80.68 +\& <u WITH CIRCUMFLEX> FB DB DB DB C3.BB 8B.6A +\& <u WITH DIAERESIS> FC DC DC DC C3.BC 8B.70 +\& <u WITH GRAVE> F9 DD DD C0 C3.B9 8B.68 ## +\& <u WITH ACUTE> FA DE DE DE C3.BA 8B.69 +\& <y WITH DIAERESIS> FF DF DF DF C3.BF 8B.73 +\& \e 5C E0 E0 BC 5C E0 ## +\& <DIVISION SIGN> F7 E1 E1 E1 C3.B7 8B.66 +\& S 53 E2 E2 E2 53 E2 +\& T 54 E3 E3 E3 54 E3 +\& U 55 E4 E4 E4 55 E4 +\& V 56 E5 E5 E5 56 E5 +\& W 57 E6 E6 E6 57 E6 +\& X 58 E7 E7 E7 58 E7 +\& Y 59 E8 E8 E8 59 E8 +\& Z 5A E9 E9 E9 5A E9 +\& <SUPERSCRIPT TWO> B2 EA EA EA C2.B2 80.59 +\& <O WITH CIRCUMFLEX> D4 EB EB EB C3.94 8A.63 +\& <O WITH DIAERESIS> D6 EC EC EC C3.96 8A.65 +\& <O WITH GRAVE> D2 ED ED ED C3.92 8A.59 +\& <O WITH ACUTE> D3 EE EE EE C3.93 8A.62 +\& <O WITH TILDE> D5 EF EF EF C3.95 8A.64 +\& 0 30 F0 F0 F0 30 F0 +\& 1 31 F1 F1 F1 31 F1 +\& 2 32 F2 F2 F2 32 F2 +\& 3 33 F3 F3 F3 33 F3 +\& 4 34 F4 F4 F4 34 F4 +\& 5 35 F5 F5 F5 35 F5 +\& 6 36 F6 F6 F6 36 F6 +\& 7 37 F7 F7 F7 37 F7 +\& 8 38 F8 F8 F8 38 F8 +\& 9 39 F9 F9 F9 39 F9 +\& <SUPERSCRIPT THREE> B3 FA FA FA C2.B3 80.62 +\& <U WITH CIRCUMFLEX> DB FB FB DD C3.9B 8A.6A ## +\& <U WITH DIAERESIS> DC FC FC FC C3.9C 8A.70 +\& <U WITH GRAVE> D9 FD FD E0 C3.99 8A.68 ## +\& <U WITH ACUTE> DA FE FE FE C3.9A 8A.69 +\& <APC> 9F FF FF 5F C2.9F FF ## +.Ve +.SH "IDENTIFYING CHARACTER CODE SETS" +.IX Header "IDENTIFYING CHARACTER CODE SETS" +It is possible to determine which character set you are operating under. +But first you need to be really really sure you need to do this. Your +code will be simpler and probably just as portable if you don't have +to test the character set and do different things, depending. There are +actually only very few circumstances where it's not easy to write +straight-line code portable to all character sets. See +"Unicode and EBCDIC" in perluniintro for how to portably specify +characters. +.PP +But there are some cases where you may want to know which character set +you are running under. One possible example is doing +sorting in inner loops where performance is critical. +.PP +To determine if you are running under ASCII or EBCDIC, you can use the +return value of \f(CWord()\fR or \f(CWchr()\fR to test one or more character +values. For example: +.PP +.Vb 4 +\& $is_ascii = "A" eq chr(65); +\& $is_ebcdic = "A" eq chr(193); +\& $is_ascii = ord("A") == 65; +\& $is_ebcdic = ord("A") == 193; +.Ve +.PP +There's even less need to distinguish between EBCDIC code pages, but to +do so try looking at one or more of the characters that differ between +them. +.PP +.Vb 4 +\& $is_ascii = ord(\*(Aq[\*(Aq) == 91; +\& $is_ebcdic_37 = ord(\*(Aq[\*(Aq) == 186; +\& $is_ebcdic_1047 = ord(\*(Aq[\*(Aq) == 173; +\& $is_ebcdic_POSIX_BC = ord(\*(Aq[\*(Aq) == 187; +.Ve +.PP +However, it would be unwise to write tests such as: +.PP +.Vb 2 +\& $is_ascii = "\er" ne chr(13); # WRONG +\& $is_ascii = "\en" ne chr(10); # ILL ADVISED +.Ve +.PP +Obviously the first of these will fail to distinguish most ASCII +platforms from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC +platform since \f(CW\*(C`"\er"\ eq\ chr(13)\*(C'\fR under all of those coded character +sets. But note too that because \f(CW"\en"\fR is \f(CWchr(13)\fR and \f(CW"\er"\fR is +\&\f(CWchr(10)\fR on old Macintosh (which is an ASCII platform) the second +\&\f(CW$is_ascii\fR test will lead to trouble there. +.PP +To determine whether or not perl was built under an EBCDIC +code page you can use the Config module like so: +.PP +.Vb 2 +\& use Config; +\& $is_ebcdic = $Config{\*(Aqebcdic\*(Aq} eq \*(Aqdefine\*(Aq; +.Ve +.SH CONVERSIONS +.IX Header "CONVERSIONS" +.ie n .SS "utf8::unicode_to_native() and utf8::native_to_unicode()" +.el .SS "\f(CWutf8::unicode_to_native()\fP and \f(CWutf8::native_to_unicode()\fP" +.IX Subsection "utf8::unicode_to_native() and utf8::native_to_unicode()" +These functions take an input numeric code point in one encoding and +return what its equivalent value is in the other. +.PP +See utf8. +.SS tr/// +.IX Subsection "tr///" +In order to convert a string of characters from one character set to +another a simple list of numbers, such as in the right columns in the +above table, along with Perl's \f(CW\*(C`tr///\*(C'\fR operator is all that is needed. +The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns +provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily +reversed. +.PP +For example, to convert ASCII/Latin1 to code page 037 take the output of the +second numbers column from the output of recipe 2 (modified to add +\&\f(CW"\e"\fR characters), and use it in \f(CW\*(C`tr///\*(C'\fR like so: +.PP +.Vb 10 +\& $cp_037 = +\& \*(Aq\ex00\ex01\ex02\ex03\ex37\ex2D\ex2E\ex2F\ex16\ex05\ex25\ex0B\ex0C\ex0D\ex0E\ex0F\*(Aq . +\& \*(Aq\ex10\ex11\ex12\ex13\ex3C\ex3D\ex32\ex26\ex18\ex19\ex3F\ex27\ex1C\ex1D\ex1E\ex1F\*(Aq . +\& \*(Aq\ex40\ex5A\ex7F\ex7B\ex5B\ex6C\ex50\ex7D\ex4D\ex5D\ex5C\ex4E\ex6B\ex60\ex4B\ex61\*(Aq . +\& \*(Aq\exF0\exF1\exF2\exF3\exF4\exF5\exF6\exF7\exF8\exF9\ex7A\ex5E\ex4C\ex7E\ex6E\ex6F\*(Aq . +\& \*(Aq\ex7C\exC1\exC2\exC3\exC4\exC5\exC6\exC7\exC8\exC9\exD1\exD2\exD3\exD4\exD5\exD6\*(Aq . +\& \*(Aq\exD7\exD8\exD9\exE2\exE3\exE4\exE5\exE6\exE7\exE8\exE9\exBA\exE0\exBB\exB0\ex6D\*(Aq . +\& \*(Aq\ex79\ex81\ex82\ex83\ex84\ex85\ex86\ex87\ex88\ex89\ex91\ex92\ex93\ex94\ex95\ex96\*(Aq . +\& \*(Aq\ex97\ex98\ex99\exA2\exA3\exA4\exA5\exA6\exA7\exA8\exA9\exC0\ex4F\exD0\exA1\ex07\*(Aq . +\& \*(Aq\ex20\ex21\ex22\ex23\ex24\ex15\ex06\ex17\ex28\ex29\ex2A\ex2B\ex2C\ex09\ex0A\ex1B\*(Aq . +\& \*(Aq\ex30\ex31\ex1A\ex33\ex34\ex35\ex36\ex08\ex38\ex39\ex3A\ex3B\ex04\ex14\ex3E\exFF\*(Aq . +\& \*(Aq\ex41\exAA\ex4A\exB1\ex9F\exB2\ex6A\exB5\exBD\exB4\ex9A\ex8A\ex5F\exCA\exAF\exBC\*(Aq . +\& \*(Aq\ex90\ex8F\exEA\exFA\exBE\exA0\exB6\exB3\ex9D\exDA\ex9B\ex8B\exB7\exB8\exB9\exAB\*(Aq . +\& \*(Aq\ex64\ex65\ex62\ex66\ex63\ex67\ex9E\ex68\ex74\ex71\ex72\ex73\ex78\ex75\ex76\ex77\*(Aq . +\& \*(Aq\exAC\ex69\exED\exEE\exEB\exEF\exEC\exBF\ex80\exFD\exFE\exFB\exFC\exAD\exAE\ex59\*(Aq . +\& \*(Aq\ex44\ex45\ex42\ex46\ex43\ex47\ex9C\ex48\ex54\ex51\ex52\ex53\ex58\ex55\ex56\ex57\*(Aq . +\& \*(Aq\ex8C\ex49\exCD\exCE\exCB\exCF\exCC\exE1\ex70\exDD\exDE\exDB\exDC\ex8D\ex8E\exDF\*(Aq; +\& +\& my $ebcdic_string = $ascii_string; +\& eval \*(Aq$ebcdic_string =~ tr/\e000\-\e377/\*(Aq . $cp_037 . \*(Aq/\*(Aq; +.Ve +.PP +To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// +arguments like so: +.PP +.Vb 2 +\& my $ascii_string = $ebcdic_string; +\& eval \*(Aq$ascii_string =~ tr/\*(Aq . $cp_037 . \*(Aq/\e000\-\e377/\*(Aq; +.Ve +.PP +Similarly one could take the output of the third numbers column from recipe 2 +to obtain a \f(CW$cp_1047\fR table. The fourth numbers column of the output from +recipe 2 could provide a \f(CW$cp_posix_bc\fR table suitable for transcoding as +well. +.PP +If you wanted to see the inverse tables, you would first have to sort on the +desired numbers column as in recipes 4, 5 or 6, then take the output of the +first numbers column. +.SS iconv +.IX Subsection "iconv" +XPG operability often implies the presence of an \fIiconv\fR utility +available from the shell or from the C library. Consult your system's +documentation for information on iconv. +.PP +On OS/390 or z/OS see the \fBiconv\fR\|(1) manpage. One way to invoke the \f(CW\*(C`iconv\*(C'\fR +shell utility from within perl would be to: +.PP +.Vb 2 +\& # OS/390 or z/OS example +\& $ascii_data = \`echo \*(Aq$ebcdic_data\*(Aq| iconv \-f IBM\-1047 \-t ISO8859\-1\` +.Ve +.PP +or the inverse map: +.PP +.Vb 2 +\& # OS/390 or z/OS example +\& $ebcdic_data = \`echo \*(Aq$ascii_data\*(Aq| iconv \-f ISO8859\-1 \-t IBM\-1047\` +.Ve +.PP +For other Perl-based conversion options see the \f(CW\*(C`Convert::*\*(C'\fR modules on CPAN. +.SS "C RTL" +.IX Subsection "C RTL" +The OS/390 and z/OS C run-time libraries provide \f(CW_atoe()\fR and \f(CW_etoa()\fR functions. +.SH "OPERATOR DIFFERENCES" +.IX Header "OPERATOR DIFFERENCES" +The \f(CW\*(C`..\*(C'\fR range operator treats certain character ranges with +care on EBCDIC platforms. For example the following array +will have twenty six elements on either an EBCDIC platform +or an ASCII platform: +.PP +.Vb 1 +\& @alphabet = (\*(AqA\*(Aq..\*(AqZ\*(Aq); # $#alphabet == 25 +.Ve +.PP +The bitwise operators such as & ^ | may return different results +when operating on string or character data in a Perl program running +on an EBCDIC platform than when run on an ASCII platform. Here is +an example adapted from the one in perlop: +.PP +.Vb 5 +\& # EBCDIC\-based examples +\& print "j p \en" ^ " a h"; # prints "JAPH\en" +\& print "JA" | " ph\en"; # prints "japh\en" +\& print "JAPH\enJunk" & "\e277\e277\e277\e277\e277"; # prints "japh\en"; +\& print \*(Aqp N$\*(Aq ^ " E<H\en"; # prints "Perl\en"; +.Ve +.PP +An interesting property of the 32 C0 control characters +in the ASCII table is that they can "literally" be constructed +as control characters in Perl, e.g. \f(CW\*(C`(chr(0)\*(C'\fR eq \f(CW\*(C`\ec@\*(C'\fR)> +\&\f(CW\*(C`(chr(1)\*(C'\fR eq \f(CW\*(C`\ecA\*(C'\fR)>, and so on. Perl on EBCDIC platforms has been +ported to take \f(CW\*(C`\ec@\*(C'\fR to \f(CWchr(0)\fR and \f(CW\*(C`\ecA\*(C'\fR to \f(CWchr(1)\fR, etc. as well, but the +characters that result depend on which code page you are +using. The table below uses the standard acronyms for the controls. +The POSIX-BC and 1047 sets are +identical throughout this range and differ from the 0037 set at only +one spot (21 decimal). Note that the line terminator character +may be generated by \f(CW\*(C`\ecJ\*(C'\fR on ASCII platforms but by \f(CW\*(C`\ecU\*(C'\fR on 1047 or POSIX-BC +platforms and cannot be generated as a \f(CW"\ec.letter."\fR control character on +0037 platforms. Note also that \f(CW\*(C`\ec\e\*(C'\fR cannot be the final element in a string +or regex, as it will absorb the terminator. But \f(CW\*(C`\ec\e\fR\f(CIX\fR\f(CW\*(C'\fR is a \f(CW\*(C`FILE +SEPARATOR\*(C'\fR concatenated with \fIX\fR for all \fIX\fR. +The outlier \f(CW\*(C`\ec?\*(C'\fR on ASCII, which yields a non\-C0 control \f(CW\*(C`DEL\*(C'\fR, +yields the outlier control \f(CW\*(C`APC\*(C'\fR on EBCDIC, the one that isn't in the +block of contiguous controls. Note that a subtlety of this is that +\&\f(CW\*(C`\ec?\*(C'\fR on ASCII platforms is an ASCII character, while it isn't +equivalent to any ASCII character in EBCDIC platforms. +.PP +.Vb 10 +\& chr ord 8859\-1 0037 1047 && POSIX\-BC +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& \ec@ 0 <NUL> <NUL> <NUL> +\& \ecA 1 <SOH> <SOH> <SOH> +\& \ecB 2 <STX> <STX> <STX> +\& \ecC 3 <ETX> <ETX> <ETX> +\& \ecD 4 <EOT> <ST> <ST> +\& \ecE 5 <ENQ> <HT> <HT> +\& \ecF 6 <ACK> <SSA> <SSA> +\& \ecG 7 <BEL> <DEL> <DEL> +\& \ecH 8 <BS> <EPA> <EPA> +\& \ecI 9 <HT> <RI> <RI> +\& \ecJ 10 <LF> <SS2> <SS2> +\& \ecK 11 <VT> <VT> <VT> +\& \ecL 12 <FF> <FF> <FF> +\& \ecM 13 <CR> <CR> <CR> +\& \ecN 14 <SO> <SO> <SO> +\& \ecO 15 <SI> <SI> <SI> +\& \ecP 16 <DLE> <DLE> <DLE> +\& \ecQ 17 <DC1> <DC1> <DC1> +\& \ecR 18 <DC2> <DC2> <DC2> +\& \ecS 19 <DC3> <DC3> <DC3> +\& \ecT 20 <DC4> <OSC> <OSC> +\& \ecU 21 <NAK> <NEL> <LF> ** +\& \ecV 22 <SYN> <BS> <BS> +\& \ecW 23 <ETB> <ESA> <ESA> +\& \ecX 24 <CAN> <CAN> <CAN> +\& \ecY 25 <EOM> <EOM> <EOM> +\& \ecZ 26 <SUB> <PU2> <PU2> +\& \ec[ 27 <ESC> <SS3> <SS3> +\& \ec\eX 28 <FS>X <FS>X <FS>X +\& \ec] 29 <GS> <GS> <GS> +\& \ec^ 30 <RS> <RS> <RS> +\& \ec_ 31 <US> <US> <US> +\& \ec? * <DEL> <APC> <APC> +.Ve +.PP +\&\f(CW\*(C`*\*(C'\fR Note: \f(CW\*(C`\ec?\*(C'\fR maps to ordinal 127 (\f(CW\*(C`DEL\*(C'\fR) on ASCII platforms, but +since ordinal 127 is a not a control character on EBCDIC machines, +\&\f(CW\*(C`\ec?\*(C'\fR instead maps on them to \f(CW\*(C`APC\*(C'\fR, which is 255 in 0037 and 1047, +and 95 in POSIX-BC. +.SH "FUNCTION DIFFERENCES" +.IX Header "FUNCTION DIFFERENCES" +.ie n .IP chr() 8 +.el .IP \f(CWchr()\fR 8 +.IX Item "chr()" +\&\f(CWchr()\fR must be given an EBCDIC code number argument to yield a desired +character return value on an EBCDIC platform. For example: +.Sp +.Vb 1 +\& $CAPITAL_LETTER_A = chr(193); +.Ve +.ie n .IP ord() 8 +.el .IP \f(CWord()\fR 8 +.IX Item "ord()" +\&\f(CWord()\fR will return EBCDIC code number values on an EBCDIC platform. +For example: +.Sp +.Vb 1 +\& $the_number_193 = ord("A"); +.Ve +.ie n .IP pack() 8 +.el .IP \f(CWpack()\fR 8 +.IX Item "pack()" +The \f(CW"c"\fR and \f(CW"C"\fR templates for \f(CWpack()\fR are dependent upon character set +encoding. Examples of usage on EBCDIC include: +.Sp +.Vb 4 +\& $foo = pack("CCCC",193,194,195,196); +\& # $foo eq "ABCD" +\& $foo = pack("C4",193,194,195,196); +\& # same thing +\& +\& $foo = pack("ccxxcc",193,194,195,196); +\& # $foo eq "AB\e0\e0CD" +.Ve +.Sp +The \f(CW"U"\fR template has been ported to mean "Unicode" on all platforms so +that +.Sp +.Vb 1 +\& pack("U", 65) eq \*(AqA\*(Aq +.Ve +.Sp +is true on all platforms. If you want native code points for the low +256, use the \f(CW"W"\fR template. This means that the equivalences +.Sp +.Vb 2 +\& pack("W", ord($character)) eq $character +\& unpack("W", $character) == ord $character +.Ve +.Sp +will hold. +.ie n .IP print() 8 +.el .IP \f(CWprint()\fR 8 +.IX Item "print()" +One must be careful with scalars and strings that are passed to +print that contain ASCII encodings. One common place +for this to occur is in the output of the MIME type header for +CGI script writing. For example, many Perl programming guides +recommend something similar to: +.Sp +.Vb 2 +\& print "Content\-type:\ettext/html\e015\e012\e015\e012"; +\& # this may be wrong on EBCDIC +.Ve +.Sp +You can instead write +.Sp +.Vb 1 +\& print "Content\-type:\ettext/html\er\en\er\en"; # OK for DGW et al +.Ve +.Sp +and have it work portably. +.Sp +That is because the translation from EBCDIC to ASCII is done +by the web server in this case. Consult your web server's documentation for +further details. +.ie n .IP printf() 8 +.el .IP \f(CWprintf()\fR 8 +.IX Item "printf()" +The formats that can convert characters to numbers and vice versa +will be different from their ASCII counterparts when executed +on an EBCDIC platform. Examples include: +.Sp +.Vb 1 +\& printf("%c%c%c",193,194,195); # prints ABC +.Ve +.ie n .IP sort() 8 +.el .IP \f(CWsort()\fR 8 +.IX Item "sort()" +EBCDIC sort results may differ from ASCII sort results especially for +mixed case strings. This is discussed in more detail below. +.ie n .IP sprintf() 8 +.el .IP \f(CWsprintf()\fR 8 +.IX Item "sprintf()" +See the discussion of \f(CW"printf()"\fR above. An example of the use +of sprintf would be: +.Sp +.Vb 1 +\& $CAPITAL_LETTER_A = sprintf("%c",193); +.Ve +.ie n .IP unpack() 8 +.el .IP \f(CWunpack()\fR 8 +.IX Item "unpack()" +See the discussion of \f(CW"pack()"\fR above. +.PP +Note that it is possible to write portable code for these by specifying +things in Unicode numbers, and using a conversion function: +.PP +.Vb 3 +\& printf("%c",utf8::unicode_to_native(65)); # prints A on all +\& # platforms +\& print utf8::native_to_unicode(ord("A")); # Likewise, prints 65 +.Ve +.PP +See "Unicode and EBCDIC" in perluniintro and "CONVERSIONS" +for other options. +.SH "REGULAR EXPRESSION DIFFERENCES" +.IX Header "REGULAR EXPRESSION DIFFERENCES" +You can write your regular expressions just like someone on an ASCII +platform would do. But keep in mind that using octal or hex notation to +specify a particular code point will give you the character that the +EBCDIC code page natively maps to it. (This is also true of all +double-quoted strings.) If you want to write portably, just use the +\&\f(CW\*(C`\eN{U+...}\*(C'\fR notation everywhere where you would have used \f(CW\*(C`\ex{...}\*(C'\fR, +and don't use octal notation at all. +.PP +Starting in Perl v5.22, this applies to ranges in bracketed character +classes. If you say, for example, \f(CW\*(C`qr/[\eN{U+20}\-\eN{U+7F}]/\*(C'\fR, it means +the characters \f(CW\*(C`\eN{U+20}\*(C'\fR, \f(CW\*(C`\eN{U+21}\*(C'\fR, ..., \f(CW\*(C`\eN{U+7F}\*(C'\fR. This range +is all the printable characters that the ASCII character set contains. +.PP +Prior to v5.22, you couldn't specify any ranges portably, except +(starting in Perl v5.5.3) all subsets of the \f(CW\*(C`[A\-Z]\*(C'\fR and \f(CW\*(C`[a\-z]\*(C'\fR +ranges are specially coded to not pick up gap characters. For example, +characters such as "ô" (\f(CW\*(C`o WITH CIRCUMFLEX\*(C'\fR) that lie between +"I" and "J" would not be matched by the regular expression range +\&\f(CW\*(C`/[H\-K]/\*(C'\fR. But if either of the range end points is explicitly numeric +(and neither is specified by \f(CW\*(C`\eN{U+...}\*(C'\fR), the gap characters are +matched: +.PP +.Vb 1 +\& /[\ex89\-\ex91]/ +.Ve +.PP +will match \f(CW\*(C`\ex8e\*(C'\fR, even though \f(CW\*(C`\ex89\*(C'\fR is "i" and \f(CW\*(C`\ex91 \*(C'\fR is "j", +and \f(CW\*(C`\ex8e\*(C'\fR is a gap character, from the alphabetic viewpoint. +.PP +Another construct to be wary of is the inappropriate use of hex (unless +you use \f(CW\*(C`\eN{U+...}\*(C'\fR) or +octal constants in regular expressions. Consider the following +set of subs: +.PP +.Vb 4 +\& sub is_c0 { +\& my $char = substr(shift,0,1); +\& $char =~ /[\e000\-\e037]/; +\& } +\& +\& sub is_print_ascii { +\& my $char = substr(shift,0,1); +\& $char =~ /[\e040\-\e176]/; +\& } +\& +\& sub is_delete { +\& my $char = substr(shift,0,1); +\& $char eq "\e177"; +\& } +\& +\& sub is_c1 { +\& my $char = substr(shift,0,1); +\& $char =~ /[\e200\-\e237]/; +\& } +\& +\& sub is_latin_1 { # But not ASCII; not C1 +\& my $char = substr(shift,0,1); +\& $char =~ /[\e240\-\e377]/; +\& } +.Ve +.PP +These are valid only on ASCII platforms. Starting in Perl v5.22, simply +changing the octal constants to equivalent \f(CW\*(C`\eN{U+...}\*(C'\fR values makes +them portable: +.PP +.Vb 4 +\& sub is_c0 { +\& my $char = substr(shift,0,1); +\& $char =~ /[\eN{U+00}\-\eN{U+1F}]/; +\& } +\& +\& sub is_print_ascii { +\& my $char = substr(shift,0,1); +\& $char =~ /[\eN{U+20}\-\eN{U+7E}]/; +\& } +\& +\& sub is_delete { +\& my $char = substr(shift,0,1); +\& $char eq "\eN{U+7F}"; +\& } +\& +\& sub is_c1 { +\& my $char = substr(shift,0,1); +\& $char =~ /[\eN{U+80}\-\eN{U+9F}]/; +\& } +\& +\& sub is_latin_1 { # But not ASCII; not C1 +\& my $char = substr(shift,0,1); +\& $char =~ /[\eN{U+A0}\-\eN{U+FF}]/; +\& } +.Ve +.PP +And here are some alternative portable ways to write them: +.PP +.Vb 3 +\& sub Is_c0 { +\& my $char = substr(shift,0,1); +\& return $char =~ /[[:cntrl:]]/a && ! Is_delete($char); +\& +\& # Alternatively: +\& # return $char =~ /[[:cntrl:]]/ +\& # && $char =~ /[[:ascii:]]/ +\& # && ! Is_delete($char); +\& } +\& +\& sub Is_print_ascii { +\& my $char = substr(shift,0,1); +\& +\& return $char =~ /[[:print:]]/a; +\& +\& # Alternatively: +\& # return $char =~ /[[:print:]]/ && $char =~ /[[:ascii:]]/; +\& +\& # Or +\& # return $char +\& # =~ /[ !"\e#\e$%&\*(Aq()*+,\e\-.\e/0\-9:;<=>?\e@A\-Z[\e\e\e]^_\`a\-z{|}~]/; +\& } +\& +\& sub Is_delete { +\& my $char = substr(shift,0,1); +\& return utf8::native_to_unicode(ord $char) == 0x7F; +\& } +\& +\& sub Is_c1 { +\& use feature \*(Aqunicode_strings\*(Aq; +\& my $char = substr(shift,0,1); +\& return $char =~ /[[:cntrl:]]/ && $char !~ /[[:ascii:]]/; +\& } +\& +\& sub Is_latin_1 { # But not ASCII; not C1 +\& use feature \*(Aqunicode_strings\*(Aq; +\& my $char = substr(shift,0,1); +\& return ord($char) < 256 +\& && $char !~ /[[:ascii:]]/ +\& && $char !~ /[[:cntrl:]]/; +\& } +.Ve +.PP +Another way to write \f(CWIs_latin_1()\fR would be +to use the characters in the range explicitly: +.PP +.Vb 5 +\& sub Is_latin_1 { +\& my $char = substr(shift,0,1); +\& $char =~ /[\ ¡¢£¤¥¦§¨©ª«¬\%®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ] +\& [ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/x; +\& } +.Ve +.PP +Although that form may run into trouble in network transit (due to the +presence of 8 bit characters) or on non ISO-Latin character sets. But +it does allow \f(CW\*(C`Is_c1\*(C'\fR to be rewritten so it works on Perls that don't +have \f(CW\*(Aqunicode_strings\*(Aq\fR (earlier than v5.14): +.PP +.Vb 6 +\& sub Is_latin_1 { # But not ASCII; not C1 +\& my $char = substr(shift,0,1); +\& return ord($char) < 256 +\& && $char !~ /[[:ascii:]]/ +\& && ! Is_latin1($char); +\& } +.Ve +.SH SOCKETS +.IX Header "SOCKETS" +Most socket programming assumes ASCII character encodings in network +byte order. Exceptions can include CGI script writing under a +host web server where the server may take care of translation for you. +Most host web servers convert EBCDIC data to ISO\-8859\-1 or Unicode on +output. +.SH SORTING +.IX Header "SORTING" +One big difference between ASCII-based character sets and EBCDIC ones +are the relative positions of the characters when sorted in native +order. Of most concern are the upper\- and lowercase letters, the +digits, and the underscore (\f(CW"_"\fR). On ASCII platforms the native sort +order has the digits come before the uppercase letters which come before +the underscore which comes before the lowercase letters. On EBCDIC, the +underscore comes first, then the lowercase letters, then the uppercase +ones, and the digits last. If sorted on an ASCII-based platform, the +two-letter abbreviation for a physician comes before the two letter +abbreviation for drive; that is: +.PP +.Vb 2 +\& @sorted = sort(qw(Dr. dr.)); # @sorted holds (\*(AqDr.\*(Aq,\*(Aqdr.\*(Aq) on ASCII, +\& # but (\*(Aqdr.\*(Aq,\*(AqDr.\*(Aq) on EBCDIC +.Ve +.PP +The property of lowercase before uppercase letters in EBCDIC is +even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. +An example would be that "Ë" (\f(CW\*(C`E WITH DIAERESIS\*(C'\fR, 203) comes +before "ë" (\f(CW\*(C`e WITH DIAERESIS\*(C'\fR, 235) on an ASCII platform, but +the latter (83) comes before the former (115) on an EBCDIC platform. +(Astute readers will note that the uppercase version of "ß" +\&\f(CW\*(C`SMALL LETTER SHARP S\*(C'\fR is simply "SS" and that the upper case versions +of "ÿ" (small \f(CW\*(C`y WITH DIAERESIS\*(C'\fR) and "µ" (\f(CW\*(C`MICRO SIGN\*(C'\fR) +are not in the 0..255 range but are in Unicode, in a Unicode enabled +Perl). +.PP +The sort order will cause differences between results obtained on +ASCII platforms versus EBCDIC platforms. What follows are some suggestions +on how to deal with these differences. +.SS "Ignore ASCII vs. EBCDIC sort differences." +.IX Subsection "Ignore ASCII vs. EBCDIC sort differences." +This is the least computationally expensive strategy. It may require +some user education. +.SS "Use a sort helper function" +.IX Subsection "Use a sort helper function" +This is completely general, but the most computationally expensive +strategy. Choose one or the other character set and transform to that +for every sort comparison. Here's a complete example that transforms +to ASCII sort order: +.PP +.Vb 2 +\& sub native_to_uni($) { +\& my $string = shift; +\& +\& # Saves time on an ASCII platform +\& return $string if ord \*(AqA\*(Aq == 65; +\& +\& my $output = ""; +\& for my $i (0 .. length($string) \- 1) { +\& $output +\& .= chr(utf8::native_to_unicode(ord(substr($string, $i, 1)))); +\& } +\& +\& # Preserve utf8ness of input onto the output, even if it didn\*(Aqt need +\& # to be utf8 +\& utf8::upgrade($output) if utf8::is_utf8($string); +\& +\& return $output; +\& } +\& +\& sub ascii_order { # Sort helper +\& return native_to_uni($a) cmp native_to_uni($b); +\& } +\& +\& sort ascii_order @list; +.Ve +.SS "MONO CASE then sort data (for non-digits, non-underscore)" +.IX Subsection "MONO CASE then sort data (for non-digits, non-underscore)" +If you don't care about where digits and underscore sort to, you can do +something like this +.PP +.Vb 3 +\& sub case_insensitive_order { # Sort helper +\& return lc($a) cmp lc($b) +\& } +\& +\& sort case_insensitive_order @list; +.Ve +.PP +If performance is an issue, and you don't care if the output is in the +same case as the input, Use \f(CW\*(C`tr///\*(C'\fR to transform to the case most +employed within the data. If the data are primarily UPPERCASE +non\-Latin1, then apply \f(CW\*(C`tr/[a\-z]/[A\-Z]/\*(C'\fR, and then \f(CWsort()\fR. If the +data are primarily lowercase non Latin1 then apply \f(CW\*(C`tr/[A\-Z]/[a\-z]/\*(C'\fR +before sorting. If the data are primarily UPPERCASE and include Latin\-1 +characters then apply: +.PP +.Vb 3 +\& tr/[a\-z]/[A\-Z]/; +\& tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/; +\& s/ß/SS/g; +.Ve +.PP +then \f(CWsort()\fR. If you have a choice, it's better to lowercase things +to avoid the problems of the two Latin\-1 characters whose uppercase is +outside Latin\-1: "ÿ" (small \f(CW\*(C`y WITH DIAERESIS\*(C'\fR) and "µ" +(\f(CW\*(C`MICRO SIGN\*(C'\fR). If you do need to upppercase, you can; with a +Unicode-enabled Perl, do: +.PP +.Vb 2 +\& tr/ÿ/\ex{178}/; +\& tr/µ/\ex{39C}/; +.Ve +.SS "Perform sorting on one type of platform only." +.IX Subsection "Perform sorting on one type of platform only." +This strategy can employ a network connection. As such +it would be computationally expensive. +.SH "TRANSFORMATION FORMATS" +.IX Header "TRANSFORMATION FORMATS" +There are a variety of ways of transforming data with an intra character set +mapping that serve a variety of purposes. Sorting was discussed in the +previous section and a few of the other more popular mapping techniques are +discussed next. +.SS "URL decoding and encoding" +.IX Subsection "URL decoding and encoding" +Note that some URLs have hexadecimal ASCII code points in them in an +attempt to overcome character or protocol limitation issues. For example +the tilde character is not on every keyboard hence a URL of the form: +.PP +.Vb 1 +\& http://www.pvhp.com/~pvhp/ +.Ve +.PP +may also be expressed as either of: +.PP +.Vb 1 +\& http://www.pvhp.com/%7Epvhp/ +\& +\& http://www.pvhp.com/%7epvhp/ +.Ve +.PP +where 7E is the hexadecimal ASCII code point for "~". Here is an example +of decoding such a URL in any EBCDIC code page: +.PP +.Vb 3 +\& $url = \*(Aqhttp://www.pvhp.com/%7Epvhp/\*(Aq; +\& $url =~ s/%([0\-9a\-fA\-F]{2})/ +\& pack("c",utf8::unicode_to_native(hex($1)))/xge; +.Ve +.PP +Conversely, here is a partial solution for the task of encoding such +a URL in any EBCDIC code page: +.PP +.Vb 5 +\& $url = \*(Aqhttp://www.pvhp.com/~pvhp/\*(Aq; +\& # The following regular expression does not address the +\& # mappings for: (\*(Aq.\*(Aq => \*(Aq%2E\*(Aq, \*(Aq/\*(Aq => \*(Aq%2F\*(Aq, \*(Aq:\*(Aq => \*(Aq%3A\*(Aq) +\& $url =~ s/([\et "#%&\e(\e),;<=>\e?\e@\e[\e\e\e]^\`{|}~])/ +\& sprintf("%%%02X",utf8::native_to_unicode(ord($1)))/xge; +.Ve +.PP +where a more complete solution would split the URL into components +and apply a full s/// substitution only to the appropriate parts. +.SS "uu encoding and decoding" +.IX Subsection "uu encoding and decoding" +The \f(CW\*(C`u\*(C'\fR template to \f(CWpack()\fR or \f(CWunpack()\fR will render EBCDIC data in +EBCDIC characters equivalent to their ASCII counterparts. For example, +the following will print "Yes indeed\en" on either an ASCII or EBCDIC +computer: +.PP +.Vb 10 +\& $all_byte_chrs = \*(Aq\*(Aq; +\& for (0..255) { $all_byte_chrs .= chr($_); } +\& $uuencode_byte_chrs = pack(\*(Aqu\*(Aq, $all_byte_chrs); +\& ($uu = <<\*(AqENDOFHEREDOC\*(Aq) =~ s/^\es*//gm; +\& M\`\`$"\`P0%!@<("0H+#\`T.#Q\`1$A,4%187&!D:&QP=\*(AqA\e@(2(C)"4F)R@I*BLL +\& M+2XO,#$R,S0U\-C<X.3H[/#T^/T!!0D\-$149\*(Aq2$E*2TQ\-3D]045)35%565UA9 +\& M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G\-T=79W>\*(AqEZ>WQ]?G^\`@8*#A(6& +\& MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S +\& MM+6VM[BYNKN\eO;Z_P,\*(Aq"P\e3%QL?(R<K+S,W.S]#1TM/4U=;7V\-G:V]S=WM_@ +\& ?X>+CY.7FY^CIZNOL[>[O\e/\*(AqR\e_3U]O?X^?K[_/W^_P\`\` +\& ENDOFHEREDOC +\& if ($uuencode_byte_chrs eq $uu) { +\& print "Yes "; +\& } +\& $uudecode_byte_chrs = unpack(\*(Aqu\*(Aq, $uuencode_byte_chrs); +\& if ($uudecode_byte_chrs eq $all_byte_chrs) { +\& print "indeed\en"; +\& } +.Ve +.PP +Here is a very spartan uudecoder that will work on EBCDIC: +.PP +.Vb 10 +\& #!/usr/local/bin/perl +\& $_ = <> until ($mode,$file) = /^begin\es*(\ed*)\es*(\eS*)/; +\& open(OUT, "> $file") if $file ne ""; +\& while(<>) { +\& last if /^end/; +\& next if /[a\-z]/; +\& next unless int((((utf8::native_to_unicode(ord()) \- 32 ) & 077) +\& + 2) / 3) +\& == int(length() / 4); +\& print OUT unpack("u", $_); +\& } +\& close(OUT); +\& chmod oct($mode), $file; +.Ve +.SS "Quoted-Printable encoding and decoding" +.IX Subsection "Quoted-Printable encoding and decoding" +On ASCII-encoded platforms it is possible to strip characters outside of +the printable set using: +.PP +.Vb 3 +\& # This QP encoder works on ASCII only +\& $qp_string =~ s/([=\ex00\-\ex1F\ex80\-\exFF])/ +\& sprintf("=%02X",ord($1))/xge; +.Ve +.PP +Starting in Perl v5.22, this is trivially changeable to work portably on +both ASCII and EBCDIC platforms. +.PP +.Vb 3 +\& # This QP encoder works on both ASCII and EBCDIC +\& $qp_string =~ s/([=\eN{U+00}\-\eN{U+1F}\eN{U+80}\-\eN{U+FF}])/ +\& sprintf("=%02X",ord($1))/xge; +.Ve +.PP +For earlier Perls, a QP encoder that works on both ASCII and EBCDIC +platforms would look somewhat like the following: +.PP +.Vb 4 +\& $delete = utf8::unicode_to_native(ord("\ex7F")); +\& $qp_string =~ +\& s/([^[:print:]$delete])/ +\& sprintf("=%02X",utf8::native_to_unicode(ord($1)))/xage; +.Ve +.PP +(although in production code the substitutions might be done +in the EBCDIC branch with the function call and separately in the +ASCII branch without the expense of the identity map; in Perl v5.22, the +identity map is optimized out so there is no expense, but the +alternative above is simpler and is also available in v5.22). +.PP +Such QP strings can be decoded with: +.PP +.Vb 3 +\& # This QP decoder is limited to ASCII only +\& $string =~ s/=([[:xdigit:][[:xdigit:])/chr hex $1/ge; +\& $string =~ s/=[\en\er]+$//; +.Ve +.PP +Whereas a QP decoder that works on both ASCII and EBCDIC platforms +would look somewhat like the following: +.PP +.Vb 3 +\& $string =~ s/=([[:xdigit:][:xdigit:]])/ +\& chr utf8::native_to_unicode(hex $1)/xge; +\& $string =~ s/=[\en\er]+$//; +.Ve +.SS "Caesarean ciphers" +.IX Subsection "Caesarean ciphers" +The practice of shifting an alphabet one or more characters for encipherment +dates back thousands of years and was explicitly detailed by Gaius Julius +Caesar in his \fBGallic Wars\fR text. A single alphabet shift is sometimes +referred to as a rotation and the shift amount is given as a number \f(CW$n\fR after +the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps +on the 26\-letter English version of the Latin alphabet. Rot13 has the +interesting property that alternate subsequent invocations are identity maps +(thus rot13 is its own non-trivial inverse in the group of 26 alphabet +rotations). Hence the following is a rot13 encoder and decoder that will +work on ASCII and EBCDIC platforms: +.PP +.Vb 1 +\& #!/usr/local/bin/perl +\& +\& while(<>){ +\& tr/n\-za\-mN\-ZA\-M/a\-zA\-Z/; +\& print; +\& } +.Ve +.PP +In one-liner form: +.PP +.Vb 1 +\& perl \-ne \*(Aqtr/n\-za\-mN\-ZA\-M/a\-zA\-Z/;print\*(Aq +.Ve +.SH "Hashing order and checksums" +.IX Header "Hashing order and checksums" +Perl deliberately randomizes hash order for security purposes on both +ASCII and EBCDIC platforms. +.PP +EBCDIC checksums will differ for the same file translated into ASCII +and vice versa. +.SH "I18N AND L10N" +.IX Header "I18N AND L10N" +Internationalization (I18N) and localization (L10N) are supported at least +in principle even on EBCDIC platforms. The details are system-dependent +and discussed under the "OS ISSUES" section below. +.SH "MULTI-OCTET CHARACTER SETS" +.IX Header "MULTI-OCTET CHARACTER SETS" +Perl works with UTF-EBCDIC, a multi-byte encoding. In Perls earlier +than v5.22, there may be various bugs in this regard. +.PP +Legacy multi byte EBCDIC code pages XXX. +.SH "OS ISSUES" +.IX Header "OS ISSUES" +There may be a few system-dependent issues +of concern to EBCDIC Perl programmers. +.SS OS/400 +.IX Subsection "OS/400" +.IP PASE 8 +.IX Item "PASE" +The PASE environment is a runtime environment for OS/400 that can run +executables built for PowerPC AIX in OS/400; see perlos400. PASE +is ASCII-based, not EBCDIC-based as the ILE. +.IP "IFS access" 8 +.IX Item "IFS access" +XXX. +.SS "OS/390, z/OS" +.IX Subsection "OS/390, z/OS" +Perl runs under Unix Systems Services or USS. +.ie n .IP """sigaction""" 8 +.el .IP \f(CWsigaction\fR 8 +.IX Item "sigaction" +\&\f(CW\*(C`SA_SIGINFO\*(C'\fR can have segmentation faults. +.ie n .IP """chcp""" 8 +.el .IP \f(CWchcp\fR 8 +.IX Item "chcp" +\&\fBchcp\fR is supported as a shell utility for displaying and changing +one's code page. See also \fBchcp\fR\|(1). +.IP "dataset access" 8 +.IX Item "dataset access" +For sequential data set access try: +.Sp +.Vb 1 +\& my @ds_records = \`cat //DSNAME\`; +.Ve +.Sp +or: +.Sp +.Vb 1 +\& my @ds_records = \`cat //\*(AqHLQ.DSNAME\*(Aq\`; +.Ve +.Sp +See also the OS390::Stdio module on CPAN. +.ie n .IP """iconv""" 8 +.el .IP \f(CWiconv\fR 8 +.IX Item "iconv" +\&\fBiconv\fR is supported as both a shell utility and a C RTL routine. +See also the \fBiconv\fR\|(1) and \fBiconv\fR\|(3) manual pages. +.IP locales 8 +.IX Item "locales" +Locales are supported. There may be glitches when a locale is another +EBCDIC code page which has some of the +code-page variant characters in other +positions. +.Sp +There aren't currently any real UTF\-8 locales, even though some locale +names contain the string "UTF\-8". +.Sp +See perllocale for information on locales. The L10N files +are in \fI/usr/nls/locale\fR. \f(CW$Config{d_setlocale}\fR is \f(CW\*(Aqdefine\*(Aq\fR on +OS/390 or z/OS. +.SS POSIX-BC? +.IX Subsection "POSIX-BC?" +XXX. +.SH BUGS +.IX Header "BUGS" +.IP \(bu 4 +Not all shells will allow multiple \f(CW\*(C`\-e\*(C'\fR string arguments to perl to +be concatenated together properly as recipes in this document +0, 2, 4, 5, and 6 might +seem to imply. +.IP \(bu 4 +There are a significant number of test failures in the CPAN modules +shipped with Perl v5.22 and 5.24. These are only in modules not primarily +maintained by Perl 5 porters. Some of these are failures in the tests +only: they don't realize that it is proper to get different results on +EBCDIC platforms. And some of the failures are real bugs. If you +compile and do a \f(CW\*(C`make test\*(C'\fR on Perl, all tests on the \f(CW\*(C`/cpan\*(C'\fR +directory are skipped. +.Sp +Encode partially works. +.IP \(bu 4 +In earlier Perl versions, when byte and character data were +concatenated, the new string was sometimes created by +decoding the byte strings as \fIISO 8859\-1 (Latin\-1)\fR, even if the +old Unicode string used EBCDIC. +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perllocale, perlfunc, perlunicode, utf8. +.SH REFERENCES +.IX Header "REFERENCES" +<http://std.dkuug.dk/i18n/charmaps> +.PP +<https://www.unicode.org/> +.PP +<https://www.unicode.org/reports/tr16/> +.PP +<https://www.sr\-ix.com/Archive/CharCodeHist/index.html> +\&\fBASCII: American Standard Code for Information Infiltration\fR Tom Jennings, +September 1999. +.PP +\&\fBThe Unicode Standard, Version 3.0\fR The Unicode Consortium, Lisa Moore ed., +ISBN 0\-201\-61633\-5, Addison Wesley Developers Press, February 2000. +.PP +\&\fBCDRA: IBM \- Character Data Representation Architecture \- +Reference and Registry\fR, IBM SC09\-2190\-00, December 1996. +.PP +"Demystifying Character Sets", Andrea Vine, Multilingual Computing +& Technology, \fB#26 Vol. 10 Issue 4\fR, August/September 1999; +ISSN 1523\-0309; Multilingual Computing Inc. Sandpoint ID, USA. +.PP +\&\fBCodes, Ciphers, and Other Cryptic and Clandestine Communication\fR +Fred B. Wrixon, ISBN 1\-57912\-040\-7, Black Dog & Leventhal Publishers, +1998. +.PP +<http://www.bobbemer.com/P\-BIT.HTM> +\&\fBIBM \- EBCDIC and the P\-bit; The biggest Computer Goof Ever\fR Robert Bemer. +.SH HISTORY +.IX Header "HISTORY" +15 April 2001: added UTF\-8 and UTF-EBCDIC to main table, pvhp. +.SH AUTHOR +.IX Header "AUTHOR" +Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 +with CCSID 0819 and 0037 help from Chris Leach and +André Pirard A.Pirard@ulg.ac.be as well as POSIX-BC +help from Thomas Dorner Thomas.Dorner@start.de. +Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and +Joe Smith. Trademarks, registered trademarks, service marks and +registered service marks used in this document are the property of +their respective owners. +.PP +Now maintained by Perl5 Porters. |