diff options
Diffstat (limited to 'upstream/debian-unstable/man3/Unicode::Collate.3perl')
-rw-r--r-- | upstream/debian-unstable/man3/Unicode::Collate.3perl | 1192 |
1 files changed, 1192 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man3/Unicode::Collate.3perl b/upstream/debian-unstable/man3/Unicode::Collate.3perl new file mode 100644 index 00000000..c507ba3c --- /dev/null +++ b/upstream/debian-unstable/man3/Unicode::Collate.3perl @@ -0,0 +1,1192 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "Unicode::Collate 3perl" +.TH Unicode::Collate 3perl 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +Unicode::Collate \- Unicode Collation Algorithm +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 1 +\& use Unicode::Collate; +\& +\& #construct +\& $Collator = Unicode::Collate\->new(%tailoring); +\& +\& #sort +\& @sorted = $Collator\->sort(@not_sorted); +\& +\& #compare +\& $result = $Collator\->cmp($a, $b); # returns 1, 0, or \-1. +.Ve +.PP +\&\fBNote:\fR Strings in \f(CW@not_sorted\fR, \f(CW$a\fR and \f(CW$b\fR are interpreted +according to Perl's Unicode support. See perlunicode, +perluniintro, perlunitut, perlunifaq, utf8. +Otherwise you can use \f(CW\*(C`preprocess\*(C'\fR or should decode them before. +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This module is an implementation of Unicode Technical Standard #10 +(a.k.a. UTS #10) \- Unicode Collation Algorithm (a.k.a. UCA). +.SS "Constructor and Tailoring" +.IX Subsection "Constructor and Tailoring" +The \f(CW\*(C`new\*(C'\fR method returns a collator object. If \fBnew()\fR is called +with no parameters, the collator should do the default collation. +.PP +.Vb 10 +\& $Collator = Unicode::Collate\->new( +\& UCA_Version => $UCA_Version, +\& alternate => $alternate, # alias for \*(Aqvariable\*(Aq +\& backwards => $levelNumber, # or \e@levelNumbers +\& entry => $element, +\& hangul_terminator => $term_primary_weight, +\& highestFFFF => $bool, +\& identical => $bool, +\& ignoreName => qr/$ignoreName/, +\& ignoreChar => qr/$ignoreChar/, +\& ignore_level2 => $bool, +\& katakana_before_hiragana => $bool, +\& level => $collationLevel, +\& long_contraction => $bool, +\& minimalFFFE => $bool, +\& normalization => $normalization_form, +\& overrideCJK => \e&overrideCJK, +\& overrideHangul => \e&overrideHangul, +\& preprocess => \e&preprocess, +\& rearrange => \e@charList, +\& rewrite => \e&rewrite, +\& suppress => \e@charList, +\& table => $filename, +\& undefName => qr/$undefName/, +\& undefChar => qr/$undefChar/, +\& upper_before_lower => $bool, +\& variable => $variable, +\& ); +.Ve +.IP UCA_Version 4 +.IX Item "UCA_Version" +If the revision (previously "tracking version") number of UCA is given, +behavior of that revision is emulated on collating. +If omitted, the return value of \f(CWUCA_Version()\fR is used. +.Sp +The following revisions are supported. The default is 43. +.Sp +.Vb 10 +\& UCA Unicode Standard DUCET (@version) +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& 8 3.1 3.0.1 (3.0.1d9) +\& 9 3.1 with Corrigendum 3 3.1.1 +\& 11 4.0.0 +\& 14 4.1.0 +\& 16 5.0.0 +\& 18 5.1.0 +\& 20 5.2.0 +\& 22 6.0.0 +\& 24 6.1.0 +\& 26 6.2.0 +\& 28 6.3.0 +\& 30 7.0.0 +\& 32 8.0.0 +\& 34 9.0.0 +\& 36 10.0.0 +\& 38 11.0.0 +\& 40 12.0.0 +\& 41 12.1.0 +\& 43 13.0.0 +.Ve +.Sp +* See below for \f(CW\*(C`long_contraction\*(C'\fR with \f(CW\*(C`UCA_Version\*(C'\fR 22 and 24. +.Sp +* Noncharacters (e.g. U+FFFF) are not ignored, and can be overridden +since \f(CW\*(C`UCA_Version\*(C'\fR 22. +.Sp +* Out-of-range codepoints (greater than U+10FFFF) are not ignored, +and can be overridden since \f(CW\*(C`UCA_Version\*(C'\fR 22. +.Sp +* Fully ignorable characters were ignored, and would not interrupt +contractions with \f(CW\*(C`UCA_Version\*(C'\fR 9 and 11. +.Sp +* Treatment of ignorables after variables and some behaviors +were changed at \f(CW\*(C`UCA_Version\*(C'\fR 9. +.Sp +* Characters regarded as CJK unified ideographs (cf. \f(CW\*(C`overrideCJK\*(C'\fR) +depend on \f(CW\*(C`UCA_Version\*(C'\fR. +.Sp +* Many hangul jamo are assigned at \f(CW\*(C`UCA_Version\*(C'\fR 20, that will affect +\&\f(CW\*(C`hangul_terminator\*(C'\fR. +.IP alternate 4 +.IX Item "alternate" +\&\-\- see 3.2.2 Alternate Weighting, version 8 of UTS #10 +.Sp +For backward compatibility, \f(CW\*(C`alternate\*(C'\fR (old name) can be used +as an alias for \f(CW\*(C`variable\*(C'\fR. +.IP backwards 4 +.IX Item "backwards" +\&\-\- see 3.4 Backward Accents, UTS #10. +.Sp +.Vb 1 +\& backwards => $levelNumber or \e@levelNumbers +.Ve +.Sp +Weights in reverse order; ex. level 2 (diacritic ordering) in French. +If omitted (or \f(CW$levelNumber\fR is \f(CW\*(C`undef\*(C'\fR or \f(CW\*(C`\e@levelNumbers\*(C'\fR is \f(CW\*(C`[]\*(C'\fR), +forwards at all the levels. +.IP entry 4 +.IX Item "entry" +\&\-\- see 5 Tailoring; 9.1 Allkeys File Format, UTS #10. +.Sp +If the same character (or a sequence of characters) exists +in the collation element table through \f(CW\*(C`table\*(C'\fR, +mapping to collation elements is overridden. +If it does not exist, the mapping is defined additionally. +.Sp +.Vb 12 +\& entry => <<\*(AqENTRY\*(Aq, # for DUCET v4.0.0 (allkeys\-4.0.0.txt) +\&0063 0068 ; [.0E6A.0020.0002.0063] # ch +\&0043 0068 ; [.0E6A.0020.0007.0043] # Ch +\&0043 0048 ; [.0E6A.0020.0008.0043] # CH +\&006C 006C ; [.0F4C.0020.0002.006C] # ll +\&004C 006C ; [.0F4C.0020.0007.004C] # Ll +\&004C 004C ; [.0F4C.0020.0008.004C] # LL +\&00F1 ; [.0F7B.0020.0002.00F1] # n\-tilde +\&006E 0303 ; [.0F7B.0020.0002.00F1] # n\-tilde +\&00D1 ; [.0F7B.0020.0008.00D1] # N\-tilde +\&004E 0303 ; [.0F7B.0020.0008.00D1] # N\-tilde +\&ENTRY +\& +\& entry => <<\*(AqENTRY\*(Aq, # for DUCET v4.0.0 (allkeys\-4.0.0.txt) +\&00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e> +\&00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E> +\&ENTRY +.Ve +.Sp +\&\fBNOTE:\fR The code point in the UCA file format (before \f(CW\*(Aq;\*(Aq\fR) +\&\fBmust\fR be a Unicode code point (defined as hexadecimal), +but not a native code point. +So \f(CW0063\fR must always denote \f(CW\*(C`U+0063\*(C'\fR, +but not a character of \f(CW"\ex63"\fR. +.Sp +Weighting may vary depending on collation element table. +So ensure the weights defined in \f(CW\*(C`entry\*(C'\fR will be consistent with +those in the collation element table loaded via \f(CW\*(C`table\*(C'\fR. +.Sp +In DUCET v4.0.0, primary weight of \f(CW\*(C`C\*(C'\fR is \f(CW0E60\fR +and that of \f(CW\*(C`D\*(C'\fR is \f(CW\*(C`0E6D\*(C'\fR. So setting primary weight of \f(CW\*(C`CH\*(C'\fR to \f(CW\*(C`0E6A\*(C'\fR +(as a value between \f(CW0E60\fR and \f(CW\*(C`0E6D\*(C'\fR) +makes ordering as \f(CW\*(C`C < CH < D\*(C'\fR. +Exactly speaking DUCET already has some characters between \f(CW\*(C`C\*(C'\fR and \f(CW\*(C`D\*(C'\fR: +\&\f(CW\*(C`small capital C\*(C'\fR (\f(CW\*(C`U+1D04\*(C'\fR) with primary weight \f(CW0E64\fR, +\&\f(CW\*(C`c\-hook/C\-hook\*(C'\fR (\f(CW\*(C`U+0188/U+0187\*(C'\fR) with \f(CW0E65\fR, +and \f(CW\*(C`c\-curl\*(C'\fR (\f(CW\*(C`U+0255\*(C'\fR) with \f(CW0E69\fR. +Then primary weight \f(CW\*(C`0E6A\*(C'\fR for \f(CW\*(C`CH\*(C'\fR makes \f(CW\*(C`CH\*(C'\fR +ordered between \f(CW\*(C`c\-curl\*(C'\fR and \f(CW\*(C`D\*(C'\fR. +.IP hangul_terminator 4 +.IX Item "hangul_terminator" +\&\-\- see 7.1.4 Trailing Weights, UTS #10. +.Sp +If a true value is given (non-zero but should be positive), +it will be added as a terminator primary weight to the end of +every standard Hangul syllable. Secondary and any higher weights +for terminator are set to zero. +If the value is false or \f(CW\*(C`hangul_terminator\*(C'\fR key does not exist, +insertion of terminator weights will not be performed. +.Sp +Boundaries of Hangul syllables are determined +according to conjoining Jamo behavior in \fIthe Unicode Standard\fR +and \fIHangulSyllableType.txt\fR. +.Sp +\&\fBImplementation Note:\fR +(1) For expansion mapping (Unicode character mapped +to a sequence of collation elements), a terminator will not be added +between collation elements, even if Hangul syllable boundary exists there. +Addition of terminator is restricted to the next position +to the last collation element. +.Sp +(2) Non-conjoining Hangul letters +(Compatibility Jamo, halfwidth Jamo, and enclosed letters) are not +automatically terminated with a terminator primary weight. +These characters may need terminator included in a collation element +table beforehand. +.IP highestFFFF 4 +.IX Item "highestFFFF" +\&\-\- see 2.4 Tailored noncharacter weights, UTS #35 (LDML) Part 5: Collation. +.Sp +If the parameter is made true, \f(CW\*(C`U+FFFF\*(C'\fR has a highest primary weight. +When a boolean of \f(CW\*(C`$coll\->ge($str, "abc")\*(C'\fR and +\&\f(CW\*(C`$coll\->le($str, "abc\ex{FFFF}")\*(C'\fR is true, it is expected that \f(CW$str\fR +begins with \f(CW"abc"\fR, or another primary equivalent. +\&\f(CW$str\fR may be \f(CW"abcd"\fR, \f(CW"abc012"\fR, but should not include \f(CW\*(C`U+FFFF\*(C'\fR +such as \f(CW"abc\ex{FFFF}xyz"\fR. +.Sp +\&\f(CW\*(C`$coll\->le($str, "abc\ex{FFFF}")\*(C'\fR works like \f(CW\*(C`$coll\->lt($str, "abd")\*(C'\fR +almost, but the latter has a problem that you should know which letter is +next to \f(CW\*(C`c\*(C'\fR. For a certain language where \f(CW\*(C`ch\*(C'\fR as the next letter, +\&\f(CW"abch"\fR is greater than \f(CW"abc\ex{FFFF}"\fR, but less than \f(CW"abd"\fR. +.Sp +Note: +This is equivalent to \f(CW\*(C`(entry => \*(AqFFFF ; [.FFFE.0020.0005.FFFF]\*(Aq)\*(C'\fR. +Any other character than \f(CW\*(C`U+FFFF\*(C'\fR can be tailored by \f(CW\*(C`entry\*(C'\fR. +.IP identical 4 +.IX Item "identical" +\&\-\- see A.3 Deterministic Comparison, UTS #10. +.Sp +By default, strings whose weights are equal should be equal, +even though their code points are not equal. +Completely ignorable characters are ignored. +.Sp +If the parameter is made true, a final, tie-breaking level is used. +If no difference of weights is found after the comparison through +all the level specified by \f(CW\*(C`level\*(C'\fR, the comparison with code points +will be performed. +For the tie-breaking comparison, the sort key has code points +of the original string appended. +Completely ignorable characters are not ignored. +.Sp +If \f(CW\*(C`preprocess\*(C'\fR and/or \f(CW\*(C`normalization\*(C'\fR is applied, the code points +of the string after them (in NFD by default) are used. +.IP ignoreChar 4 +.IX Item "ignoreChar" +.PD 0 +.IP ignoreName 4 +.IX Item "ignoreName" +.PD +\&\-\- see 3.6 Variable Weighting, UTS #10. +.Sp +Makes the entry in the table completely ignorable; +i.e. as if the weights were zero at all level. +.Sp +Through \f(CW\*(C`ignoreChar\*(C'\fR, any character matching \f(CW\*(C`qr/$ignoreChar/\*(C'\fR +will be ignored. Through \f(CW\*(C`ignoreName\*(C'\fR, any character whose name +(given in the \f(CW\*(C`table\*(C'\fR file as a comment) matches \f(CW\*(C`qr/$ignoreName/\*(C'\fR +will be ignored. +.Sp +E.g. when 'a' and 'e' are ignorable, +\&'element' is equal to 'lament' (or 'lmnt'). +.IP ignore_level2 4 +.IX Item "ignore_level2" +\&\-\- see 5.1 Parametric Tailoring, UTS #10. +.Sp +By default, case-sensitive comparison (that is level 3 difference) +won't ignore accents (that is level 2 difference). +.Sp +If the parameter is made true, accents (and other primary ignorable +characters) are ignored, even though cases are taken into account. +.Sp +\&\fBNOTE\fR: \f(CW\*(C`level\*(C'\fR should be 3 or greater. +.IP katakana_before_hiragana 4 +.IX Item "katakana_before_hiragana" +\&\-\- see 7.2 Tertiary Weight Table, UTS #10. +.Sp +By default, hiragana is before katakana. +If the parameter is made true, this is reversed. +.Sp +\&\fBNOTE\fR: This parameter simplemindedly assumes that any hiragana/katakana +distinctions must occur in level 3, and their weights at level 3 must be +same as those mentioned in 7.3.1, UTS #10. +If you define your collation elements which violate this requirement, +this parameter does not work validly. +.IP level 4 +.IX Item "level" +\&\-\- see 4.3 Form Sort Key, UTS #10. +.Sp +Set the maximum level. +Any higher levels than the specified one are ignored. +.Sp +.Vb 4 +\& Level 1: alphabetic ordering +\& Level 2: diacritic ordering +\& Level 3: case ordering +\& Level 4: tie\-breaking (e.g. in the case when variable is \*(Aqshifted\*(Aq) +\& +\& ex.level => 2, +.Ve +.Sp +If omitted, the maximum is the 4th. +.Sp +\&\fBNOTE:\fR The DUCET includes weights over 0xFFFF at the 4th level. +But this module only uses weights within 0xFFFF. +When \f(CW\*(C`variable\*(C'\fR is 'blanked' or 'non\-ignorable' (other than 'shifted' +and 'shift\-trimmed'), the level 4 may be unreliable. +.Sp +See also \f(CW\*(C`identical\*(C'\fR. +.IP long_contraction 4 +.IX Item "long_contraction" +\&\-\- see 3.8.2 Well-Formedness of the DUCET, 4.2 Produce Array, UTS #10. +.Sp +If the parameter is made true, for a contraction with three or more +characters (here nicknamed "long contraction"), initial substrings +will be handled. +For example, a contraction ABC, where A is a starter, and B and C +are non-starters (character with non-zero combining character class), +will be detected even if there is not AB as a contraction. +.Sp +\&\fBDefault:\fR Usually false. +If \f(CW\*(C`UCA_Version\*(C'\fR is 22 or 24, and the value of \f(CW\*(C`long_contraction\*(C'\fR +is not specified in \f(CWnew()\fR, a true value is set implicitly. +This is a workaround to pass Conformance Tests for Unicode 6.0.0 and 6.1.0. +.Sp +\&\f(CWchange()\fR handles \f(CW\*(C`long_contraction\*(C'\fR explicitly only. +If \f(CW\*(C`long_contraction\*(C'\fR is not specified in \f(CWchange()\fR, even though +\&\f(CW\*(C`UCA_Version\*(C'\fR is changed, \f(CW\*(C`long_contraction\*(C'\fR will not be changed. +.Sp +\&\fBLimitation:\fR Scanning non-starters is one-way (no back tracking). +If AB is found but not ABC is not found, other long contraction where +the first character is A and the second is not B may not be found. +.Sp +Under \f(CW\*(C`(normalization => undef)\*(C'\fR, detection step of discontiguous +contractions will be skipped. +.Sp +\&\fBNote:\fR The following contractions in DUCET are not considered +in steps S2.1.1 to S2.1.3, where they are discontiguous. +.Sp +.Vb 2 +\& 0FB2 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC RR) +\& 0FB3 0F71 0F80 (TIBETAN VOWEL SIGN VOCALIC LL) +.Ve +.Sp +For example \f(CW\*(C`TIBETAN VOWEL SIGN VOCALIC RR\*(C'\fR with \f(CW\*(C`COMBINING TILDE OVERLAY\*(C'\fR +(\f(CW\*(C`U+0344\*(C'\fR) is \f(CW\*(C`0FB2 0344 0F71 0F80\*(C'\fR in NFD. +In this case \f(CW\*(C`0FB2 0F80\*(C'\fR (\f(CW\*(C`TIBETAN VOWEL SIGN VOCALIC R\*(C'\fR) is detected, +instead of \f(CW\*(C`0FB2 0F71 0F80\*(C'\fR. +Inserted \f(CW0344\fR makes \f(CW\*(C`0FB2 0F71 0F80\*(C'\fR discontiguous and lack of +contraction \f(CW\*(C`0FB2 0F71\*(C'\fR prohibits \f(CW\*(C`0FB2 0F71 0F80\*(C'\fR from being detected. +.IP minimalFFFE 4 +.IX Item "minimalFFFE" +\&\-\- see 1.1.1 U+FFFE, UTS #35 (LDML) Part 5: Collation. +.Sp +If the parameter is made true, \f(CW\*(C`U+FFFE\*(C'\fR has a minimal primary weight. +The comparison between \f(CW"$a1\ex{FFFE}$a2"\fR and \f(CW"$b1\ex{FFFE}$b2"\fR +first compares \f(CW$a1\fR and \f(CW$b1\fR at level 1, and +then \f(CW$a2\fR and \f(CW$b2\fR at level 1, as followed. +.Sp +.Vb 12 +\& "ab\ex{FFFE}a" +\& "Ab\ex{FFFE}a" +\& "ab\ex{FFFE}c" +\& "Ab\ex{FFFE}c" +\& "ab\ex{FFFE}xyz" +\& "abc\ex{FFFE}def" +\& "abc\ex{FFFE}xYz" +\& "aBc\ex{FFFE}xyz" +\& "abcX\ex{FFFE}def" +\& "abcx\ex{FFFE}xyz" +\& "b\ex{FFFE}aaa" +\& "bbb\ex{FFFE}a" +.Ve +.Sp +Note: +This is equivalent to \f(CW\*(C`(entry => \*(AqFFFE ; [.0001.0020.0005.FFFE]\*(Aq)\*(C'\fR. +Any other character than \f(CW\*(C`U+FFFE\*(C'\fR can be tailored by \f(CW\*(C`entry\*(C'\fR. +.IP normalization 4 +.IX Item "normalization" +\&\-\- see 4.1 Normalize, UTS #10. +.Sp +If specified, strings are normalized before preparation of sort keys +(the normalization is executed after preprocess). +.Sp +A form name \f(CWUnicode::Normalize::normalize()\fR accepts will be applied +as \f(CW$normalization_form\fR. +Acceptable names include \f(CW\*(AqNFD\*(Aq\fR, \f(CW\*(AqNFC\*(Aq\fR, \f(CW\*(AqNFKD\*(Aq\fR, and \f(CW\*(AqNFKC\*(Aq\fR. +See \f(CWUnicode::Normalize::normalize()\fR for detail. +If omitted, \f(CW\*(AqNFD\*(Aq\fR is used. +.Sp +\&\f(CW\*(C`normalization\*(C'\fR is performed after \f(CW\*(C`preprocess\*(C'\fR (if defined). +.Sp +Furthermore, special values, \f(CW\*(C`undef\*(C'\fR and \f(CW"prenormalized"\fR, can be used, +though they are not concerned with \f(CWUnicode::Normalize::normalize()\fR. +.Sp +If \f(CW\*(C`undef\*(C'\fR (not a string \f(CW"undef"\fR) is passed explicitly +as the value for this key, +any normalization is not carried out (this may make tailoring easier +if any normalization is not desired). Under \f(CW\*(C`(normalization => undef)\*(C'\fR, +only contiguous contractions are resolved; +e.g. even if \f(CW\*(C`A\-ring\*(C'\fR (and \f(CW\*(C`A\-ring\-cedilla\*(C'\fR) is ordered after \f(CW\*(C`Z\*(C'\fR, +\&\f(CW\*(C`A\-cedilla\-ring\*(C'\fR would be primary equal to \f(CW\*(C`A\*(C'\fR. +In this point, +\&\f(CW\*(C`(normalization => undef, preprocess => sub { NFD(shift) })\*(C'\fR +\&\fBis not\fR equivalent to \f(CW\*(C`(normalization => \*(AqNFD\*(Aq)\*(C'\fR. +.Sp +In the case of \f(CW\*(C`(normalization => "prenormalized")\*(C'\fR, +any normalization is not performed, but +discontiguous contractions with combining characters are performed. +Therefore +\&\f(CW\*(C`(normalization => \*(Aqprenormalized\*(Aq, preprocess => sub { NFD(shift) })\*(C'\fR +\&\fBis\fR equivalent to \f(CW\*(C`(normalization => \*(AqNFD\*(Aq)\*(C'\fR. +If source strings are finely prenormalized, +\&\f(CW\*(C`(normalization => \*(Aqprenormalized\*(Aq)\*(C'\fR may save time for normalization. +.Sp +Except \f(CW\*(C`(normalization => undef)\*(C'\fR, +\&\fBUnicode::Normalize\fR is required (see also \fBCAVEAT\fR). +.IP overrideCJK 4 +.IX Item "overrideCJK" +\&\-\- see 7.1 Derived Collation Elements, UTS #10. +.Sp +By default, CJK unified ideographs are ordered in Unicode codepoint +order, but those in the CJK Unified Ideographs block are less than +those in the CJK Unified Ideographs Extension A etc. +.Sp +.Vb 10 +\& In the CJK Unified Ideographs block: +\& U+4E00..U+9FA5 if UCA_Version is 8, 9 or 11. +\& U+4E00..U+9FBB if UCA_Version is 14 or 16. +\& U+4E00..U+9FC3 if UCA_Version is 18. +\& U+4E00..U+9FCB if UCA_Version is 20 or 22. +\& U+4E00..U+9FCC if UCA_Version is 24 to 30. +\& U+4E00..U+9FD5 if UCA_Version is 32 or 34. +\& U+4E00..U+9FEA if UCA_Version is 36. +\& U+4E00..U+9FEF if UCA_Version is 38, 40 or 41. +\& U+4E00..U+9FFC if UCA_Version is 43. +\& +\& In the CJK Unified Ideographs Extension blocks: +\& Ext.A (U+3400..U+4DB5) if UCA_Version is 8 to 41. +\& Ext.A (U+3400..U+4DBF) if UCA_Version is 43. +\& Ext.B (U+20000..U+2A6D6) if UCA_Version is 8 to 41. +\& Ext.B (U+20000..U+2A6DD) if UCA_Version is 43. +\& Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or later. +\& Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or later. +\& Ext.E (U+2B820..U+2CEA1) if UCA_Version is 32 or later. +\& Ext.F (U+2CEB0..U+2EBE0) if UCA_Version is 36 or later. +\& Ext.G (U+30000..U+3134A) if UCA_Version is 43. +.Ve +.Sp +Through \f(CW\*(C`overrideCJK\*(C'\fR, ordering of CJK unified ideographs (including +extensions) can be overridden. +.Sp +ex. CJK unified ideographs in the JIS code point order. +.Sp +.Vb 7 +\& overrideCJK => sub { +\& my $u = shift; # get a Unicode codepoint +\& my $b = pack(\*(Aqn\*(Aq, $u); # to UTF\-16BE +\& my $s = your_unicode_to_sjis_converter($b); # convert +\& my $n = unpack(\*(Aqn\*(Aq, $s); # convert sjis to short +\& [ $n, 0x20, 0x2, $u ]; # return the collation element +\& }, +.Ve +.Sp +The return value may be an arrayref of 1st to 4th weights as shown +above. The return value may be an integer as the primary weight +as shown below. If \f(CW\*(C`undef\*(C'\fR is returned, the default derived +collation element will be used. +.Sp +.Vb 7 +\& overrideCJK => sub { +\& my $u = shift; # get a Unicode codepoint +\& my $b = pack(\*(Aqn\*(Aq, $u); # to UTF\-16BE +\& my $s = your_unicode_to_sjis_converter($b); # convert +\& my $n = unpack(\*(Aqn\*(Aq, $s); # convert sjis to short +\& return $n; # return the primary weight +\& }, +.Ve +.Sp +The return value may be a list containing zero or more of +an arrayref, an integer, or \f(CW\*(C`undef\*(C'\fR. +.Sp +ex. ignores all CJK unified ideographs. +.Sp +.Vb 1 +\& overrideCJK => sub {()}, # CODEREF returning empty list +\& +\& # where \->eq("Pe\ex{4E00}rl", "Perl") is true +\& # as U+4E00 is a CJK unified ideograph and to be ignorable. +.Ve +.Sp +If a false value (including \f(CW\*(C`undef\*(C'\fR) is passed, \f(CW\*(C`overrideCJK\*(C'\fR +has no effect. +\&\f(CW\*(C`$Collator\->change(overrideCJK => 0)\*(C'\fR resets the old one. +.Sp +But assignment of weight for CJK unified ideographs +in \f(CW\*(C`table\*(C'\fR or \f(CW\*(C`entry\*(C'\fR is still valid. +If \f(CW\*(C`undef\*(C'\fR is passed explicitly as the value for this key, +weights for CJK unified ideographs are treated as undefined. +However when \f(CW\*(C`UCA_Version\*(C'\fR > 8, \f(CW\*(C`(overrideCJK => undef)\*(C'\fR +has no special meaning. +.Sp +\&\fBNote:\fR In addition to them, 12 CJK compatibility ideographs (\f(CW\*(C`U+FA0E\*(C'\fR, +\&\f(CW\*(C`U+FA0F\*(C'\fR, \f(CW\*(C`U+FA11\*(C'\fR, \f(CW\*(C`U+FA13\*(C'\fR, \f(CW\*(C`U+FA14\*(C'\fR, \f(CW\*(C`U+FA1F\*(C'\fR, \f(CW\*(C`U+FA21\*(C'\fR, \f(CW\*(C`U+FA23\*(C'\fR, +\&\f(CW\*(C`U+FA24\*(C'\fR, \f(CW\*(C`U+FA27\*(C'\fR, \f(CW\*(C`U+FA28\*(C'\fR, \f(CW\*(C`U+FA29\*(C'\fR) are also treated as CJK unified +ideographs. But they can't be overridden via \f(CW\*(C`overrideCJK\*(C'\fR when you use +DUCET, as the table includes weights for them. \f(CW\*(C`table\*(C'\fR or \f(CW\*(C`entry\*(C'\fR has +priority over \f(CW\*(C`overrideCJK\*(C'\fR. +.IP overrideHangul 4 +.IX Item "overrideHangul" +\&\-\- see 7.1 Derived Collation Elements, UTS #10. +.Sp +By default, Hangul syllables are decomposed into Hangul Jamo, +even if \f(CW\*(C`(normalization => undef)\*(C'\fR. +But the mapping of Hangul syllables may be overridden. +.Sp +This parameter works like \f(CW\*(C`overrideCJK\*(C'\fR, so see there for examples. +.Sp +If you want to override the mapping of Hangul syllables, +NFD and NFKD are not appropriate, since NFD and NFKD will decompose +Hangul syllables before overriding. FCD may decompose Hangul syllables +as the case may be. +.Sp +If a false value (but not \f(CW\*(C`undef\*(C'\fR) is passed, \f(CW\*(C`overrideHangul\*(C'\fR +has no effect. +\&\f(CW\*(C`$Collator\->change(overrideHangul => 0)\*(C'\fR resets the old one. +.Sp +If \f(CW\*(C`undef\*(C'\fR is passed explicitly as the value for this key, +weight for Hangul syllables is treated as undefined +without decomposition into Hangul Jamo. +But definition of weight for Hangul syllables +in \f(CW\*(C`table\*(C'\fR or \f(CW\*(C`entry\*(C'\fR is still valid. +.IP overrideOut 4 +.IX Item "overrideOut" +\&\-\- see 7.1.1 Handling Ill-Formed Code Unit Sequences, UTS #10. +.Sp +Perl seems to allow out-of-range values (greater than 0x10FFFF). +By default, out-of-range values are replaced with \f(CW\*(C`U+FFFD\*(C'\fR +(REPLACEMENT CHARACTER) when \f(CW\*(C`UCA_Version\*(C'\fR >= 22, +or ignored when \f(CW\*(C`UCA_Version\*(C'\fR <= 20. +.Sp +When \f(CW\*(C`UCA_Version\*(C'\fR >= 22, the weights of out-of-range values +can be overridden. Though \f(CW\*(C`table\*(C'\fR or \f(CW\*(C`entry\*(C'\fR are available for them, +out-of-range values are too many. +.Sp +\&\f(CW\*(C`overrideOut\*(C'\fR can perform it algorithmically. +This parameter works like \f(CW\*(C`overrideCJK\*(C'\fR, so see there for examples. +.Sp +ex. ignores all out-of-range values. +.Sp +.Vb 1 +\& overrideOut => sub {()}, # CODEREF returning empty list +.Ve +.Sp +If a false value (including \f(CW\*(C`undef\*(C'\fR) is passed, \f(CW\*(C`overrideOut\*(C'\fR +has no effect. +\&\f(CW\*(C`$Collator\->change(overrideOut => 0)\*(C'\fR resets the old one. +.Sp +\&\fBNOTE ABOUT U+FFFD:\fR +.Sp +UCA recommends that out-of-range values should not be ignored for security +reasons. Say, \f(CW"pe\ex{110000}rl"\fR should not be equal to \f(CW"perl"\fR. +However, \f(CW\*(C`U+FFFD\*(C'\fR is wrongly mapped to a variable collation element +in DUCET for Unicode 6.0.0 to 6.2.0, that means out-of-range values will be +ignored when \f(CW\*(C`variable\*(C'\fR isn't \f(CW\*(C`Non\-ignorable\*(C'\fR. +.Sp +The mapping of \f(CW\*(C`U+FFFD\*(C'\fR is corrected in Unicode 6.3.0. +see <http://www.unicode.org/reports/tr10/tr10\-28.html#Trailing_Weights> +(7.1.4 Trailing Weights). Such a correction is reproduced by this. +.Sp +.Vb 1 +\& overrideOut => sub { 0xFFFD }, # CODEREF returning a very large integer +.Ve +.Sp +This workaround is unnecessary since Unicode 6.3.0. +.IP preprocess 4 +.IX Item "preprocess" +\&\-\- see 5.4 Preprocessing, UTS #10. +.Sp +If specified, the coderef is used to preprocess each string +before the formation of sort keys. +.Sp +ex. dropping English articles, such as "a" or "the". +Then, "the pen" is before "a pencil". +.Sp +.Vb 5 +\& preprocess => sub { +\& my $str = shift; +\& $str =~ s/\eb(?:an?|the)\es+//gi; +\& return $str; +\& }, +.Ve +.Sp +\&\f(CW\*(C`preprocess\*(C'\fR is performed before \f(CW\*(C`normalization\*(C'\fR (if defined). +.Sp +ex. decoding strings in a legacy encoding such as shift-jis: +.Sp +.Vb 4 +\& $sjis_collator = Unicode::Collate\->new( +\& preprocess => \e&your_shiftjis_to_unicode_decoder, +\& ); +\& @result = $sjis_collator\->sort(@shiftjis_strings); +.Ve +.Sp +\&\fBNote:\fR Strings returned from the coderef will be interpreted +according to Perl's Unicode support. See perlunicode, +perluniintro, perlunitut, perlunifaq, utf8. +.IP rearrange 4 +.IX Item "rearrange" +\&\-\- see 3.5 Rearrangement, UTS #10. +.Sp +Characters that are not coded in logical order and to be rearranged. +If \f(CW\*(C`UCA_Version\*(C'\fR is equal to or less than 11, default is: +.Sp +.Vb 1 +\& rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ], +.Ve +.Sp +If you want to disallow any rearrangement, pass \f(CW\*(C`undef\*(C'\fR or \f(CW\*(C`[]\*(C'\fR +(a reference to empty list) as the value for this key. +.Sp +If \f(CW\*(C`UCA_Version\*(C'\fR is equal to or greater than 14, default is \f(CW\*(C`[]\*(C'\fR +(i.e. no rearrangement). +.Sp +\&\fBAccording to the version 9 of UCA, this parameter shall not be used; +but it is not warned at present.\fR +.IP rewrite 4 +.IX Item "rewrite" +If specified, the coderef is used to rewrite lines in \f(CW\*(C`table\*(C'\fR or \f(CW\*(C`entry\*(C'\fR. +The coderef will get each line, and then should return a rewritten line +according to the UCA file format. +If the coderef returns an empty line, the line will be skipped. +.Sp +e.g. any primary ignorable characters into tertiary ignorable: +.Sp +.Vb 5 +\& rewrite => sub { +\& my $line = shift; +\& $line =~ s/\e[\e.0000\e..{4}\e..{4}\e./[.0000.0000.0000./g; +\& return $line; +\& }, +.Ve +.Sp +This example shows rewriting weights. \f(CW\*(C`rewrite\*(C'\fR is allowed to +affect code points, weights, and the name. +.Sp +\&\fBNOTE\fR: \f(CW\*(C`table\*(C'\fR is available to use another table file; +preparing a modified table once would be more efficient than +rewriting lines on reading an unmodified table every time. +.IP suppress 4 +.IX Item "suppress" +\&\-\- see 3.12 Special-Purpose Commands, UTS #35 (LDML) Part 5: Collation. +.Sp +Contractions beginning with the specified characters are suppressed, +even if those contractions are defined in \f(CW\*(C`table\*(C'\fR. +.Sp +An example for Russian and some languages using the Cyrillic script: +.Sp +.Vb 1 +\& suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F], +.Ve +.Sp +where 0x0400 stands for \f(CW\*(C`U+0400\*(C'\fR, CYRILLIC CAPITAL LETTER IE WITH GRAVE. +.Sp +\&\fBNOTE\fR: Contractions via \f(CW\*(C`entry\*(C'\fR will not be suppressed. +.IP table 4 +.IX Item "table" +\&\-\- see 3.8 Default Unicode Collation Element Table, UTS #10. +.Sp +You can use another collation element table if desired. +.Sp +The table file should locate in the \fIUnicode/Collate\fR directory +on \f(CW@INC\fR. Say, if the filename is \fIFoo.txt\fR, +the table file is searched as \fIUnicode/Collate/Foo.txt\fR in \f(CW@INC\fR. +.Sp +By default, \fIallkeys.txt\fR (as the filename of DUCET) is used. +If you will prepare your own table file, any name other than \fIallkeys.txt\fR +may be better to avoid namespace conflict. +.Sp +\&\fBNOTE\fR: When XSUB is used, the DUCET is compiled on building this +module, and it may save time at the run time. +Explicit saying \f(CW\*(C`(table => \*(Aqallkeys.txt\*(Aq)\*(C'\fR, or using another table, +or using \f(CW\*(C`ignoreChar\*(C'\fR, \f(CW\*(C`ignoreName\*(C'\fR, \f(CW\*(C`undefChar\*(C'\fR, \f(CW\*(C`undefName\*(C'\fR or +\&\f(CW\*(C`rewrite\*(C'\fR will prevent this module from using the compiled DUCET. +.Sp +If \f(CW\*(C`undef\*(C'\fR is passed explicitly as the value for this key, +no file is read (but you can define collation elements via \f(CW\*(C`entry\*(C'\fR). +.Sp +A typical way to define a collation element table +without any file of table: +.Sp +.Vb 11 +\& $onlyABC = Unicode::Collate\->new( +\& table => undef, +\& entry => << \*(AqENTRIES\*(Aq, +\&0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A +\&0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A +\&0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B +\&0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B +\&0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C +\&0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C +\&ENTRIES +\& ); +.Ve +.Sp +If \f(CW\*(C`ignoreName\*(C'\fR or \f(CW\*(C`undefName\*(C'\fR is used, character names should be +specified as a comment (following \f(CW\*(C`#\*(C'\fR) on each line. +.IP undefChar 4 +.IX Item "undefChar" +.PD 0 +.IP undefName 4 +.IX Item "undefName" +.PD +\&\-\- see 6.3.3 Reducing the Repertoire, UTS #10. +.Sp +Undefines the collation element as if it were unassigned in the \f(CW\*(C`table\*(C'\fR. +This reduces the size of the table. +If an unassigned character appears in the string to be collated, +the sort key is made from its codepoint +as a single-character collation element, +as it is greater than any other assigned collation elements +(in the codepoint order among the unassigned characters). +But, it'd be better to ignore characters +unfamiliar to you and maybe never used. +.Sp +Through \f(CW\*(C`undefChar\*(C'\fR, any character matching \f(CW\*(C`qr/$undefChar/\*(C'\fR +will be undefined. Through \f(CW\*(C`undefName\*(C'\fR, any character whose name +(given in the \f(CW\*(C`table\*(C'\fR file as a comment) matches \f(CW\*(C`qr/$undefName/\*(C'\fR +will be undefined. +.Sp +ex. Collation weights for beyond-BMP characters are not stored in object: +.Sp +.Vb 1 +\& undefChar => qr/[^\e0\-\ex{fffd}]/, +.Ve +.IP upper_before_lower 4 +.IX Item "upper_before_lower" +\&\-\- see 6.6 Case Comparisons, UTS #10. +.Sp +By default, lowercase is before uppercase. +If the parameter is made true, this is reversed. +.Sp +\&\fBNOTE\fR: This parameter simplemindedly assumes that any lowercase/uppercase +distinctions must occur in level 3, and their weights at level 3 must be +same as those mentioned in 7.3.1, UTS #10. +If you define your collation elements which differs from this requirement, +this parameter doesn't work validly. +.IP variable 4 +.IX Item "variable" +\&\-\- see 3.6 Variable Weighting, UTS #10. +.Sp +This key allows for variable weighting of variable collation elements, +which are marked with an ASTERISK in the table +(NOTE: Many punctuation marks and symbols are variable in \fIallkeys.txt\fR). +.Sp +.Vb 1 +\& variable => \*(Aqblanked\*(Aq, \*(Aqnon\-ignorable\*(Aq, \*(Aqshifted\*(Aq, or \*(Aqshift\-trimmed\*(Aq. +.Ve +.Sp +These names are case-insensitive. +By default (if specification is omitted), 'shifted' is adopted. +.Sp +.Vb 2 +\& \*(AqBlanked\*(Aq Variable elements are made ignorable at levels 1 through 3; +\& considered at the 4th level. +\& +\& \*(AqNon\-Ignorable\*(Aq Variable elements are not reset to ignorable. +\& +\& \*(AqShifted\*(Aq Variable elements are made ignorable at levels 1 through 3 +\& their level 4 weight is replaced by the old level 1 weight. +\& Level 4 weight for Non\-Variable elements is 0xFFFF. +\& +\& \*(AqShift\-Trimmed\*(Aq Same as \*(Aqshifted\*(Aq, but all FFFF\*(Aqs at the 4th level +\& are trimmed. +.Ve +.SS "Methods for Collation" +.IX Subsection "Methods for Collation" +.ie n .IP """@sorted = $Collator\->sort(@not_sorted)""" 4 +.el .IP "\f(CW@sorted = $Collator\->sort(@not_sorted)\fR" 4 +.IX Item "@sorted = $Collator->sort(@not_sorted)" +Sorts a list of strings. +.ie n .IP """$result = $Collator\->cmp($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->cmp($a, $b)\fR" 4 +.IX Item "$result = $Collator->cmp($a, $b)" +Returns 1 (when \f(CW$a\fR is greater than \f(CW$b\fR) +or 0 (when \f(CW$a\fR is equal to \f(CW$b\fR) +or \-1 (when \f(CW$a\fR is less than \f(CW$b\fR). +.ie n .IP """$result = $Collator\->eq($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->eq($a, $b)\fR" 4 +.IX Item "$result = $Collator->eq($a, $b)" +.PD 0 +.ie n .IP """$result = $Collator\->ne($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->ne($a, $b)\fR" 4 +.IX Item "$result = $Collator->ne($a, $b)" +.ie n .IP """$result = $Collator\->lt($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->lt($a, $b)\fR" 4 +.IX Item "$result = $Collator->lt($a, $b)" +.ie n .IP """$result = $Collator\->le($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->le($a, $b)\fR" 4 +.IX Item "$result = $Collator->le($a, $b)" +.ie n .IP """$result = $Collator\->gt($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->gt($a, $b)\fR" 4 +.IX Item "$result = $Collator->gt($a, $b)" +.ie n .IP """$result = $Collator\->ge($a, $b)""" 4 +.el .IP "\f(CW$result = $Collator\->ge($a, $b)\fR" 4 +.IX Item "$result = $Collator->ge($a, $b)" +.PD +They works like the same name operators as theirs. +.Sp +.Vb 6 +\& eq : whether $a is equal to $b. +\& ne : whether $a is not equal to $b. +\& lt : whether $a is less than $b. +\& le : whether $a is less than $b or equal to $b. +\& gt : whether $a is greater than $b. +\& ge : whether $a is greater than $b or equal to $b. +.Ve +.ie n .IP """$sortKey = $Collator\->getSortKey($string)""" 4 +.el .IP "\f(CW$sortKey = $Collator\->getSortKey($string)\fR" 4 +.IX Item "$sortKey = $Collator->getSortKey($string)" +\&\-\- see 4.3 Form Sort Key, UTS #10. +.Sp +Returns a sort key. +.Sp +You compare the sort keys using a binary comparison +and get the result of the comparison of the strings using UCA. +.Sp +.Vb 1 +\& $Collator\->getSortKey($a) cmp $Collator\->getSortKey($b) +\& +\& is equivalent to +\& +\& $Collator\->cmp($a, $b) +.Ve +.ie n .IP """$sortKeyForm = $Collator\->viewSortKey($string)""" 4 +.el .IP "\f(CW$sortKeyForm = $Collator\->viewSortKey($string)\fR" 4 +.IX Item "$sortKeyForm = $Collator->viewSortKey($string)" +Converts a sorting key into its representation form. +If \f(CW\*(C`UCA_Version\*(C'\fR is 8, the output is slightly different. +.Sp +.Vb 3 +\& use Unicode::Collate; +\& my $c = Unicode::Collate\->new(); +\& print $c\->viewSortKey("Perl"),"\en"; +\& +\& # output: +\& # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF] +\& # Level 1 Level 2 Level 3 Level 4 +.Ve +.SS "Methods for Searching" +.IX Subsection "Methods for Searching" +The \f(CW\*(C`match\*(C'\fR, \f(CW\*(C`gmatch\*(C'\fR, \f(CW\*(C`subst\*(C'\fR, \f(CW\*(C`gsubst\*(C'\fR methods work +like \f(CW\*(C`m//\*(C'\fR, \f(CW\*(C`m//g\*(C'\fR, \f(CW\*(C`s///\*(C'\fR, \f(CW\*(C`s///g\*(C'\fR, respectively, +but they are not aware of any pattern, but only a literal substring. +.PP +\&\fBDISCLAIMER:\fR If \f(CW\*(C`preprocess\*(C'\fR or \f(CW\*(C`normalization\*(C'\fR parameter is true +for \f(CW$Collator\fR, calling these methods (\f(CW\*(C`index\*(C'\fR, \f(CW\*(C`match\*(C'\fR, \f(CW\*(C`gmatch\*(C'\fR, +\&\f(CW\*(C`subst\*(C'\fR, \f(CW\*(C`gsubst\*(C'\fR) is croaked, as the position and the length might +differ from those on the specified string. +.PP +\&\f(CW\*(C`rearrange\*(C'\fR and \f(CW\*(C`hangul_terminator\*(C'\fR parameters are neglected. +\&\f(CW\*(C`katakana_before_hiragana\*(C'\fR and \f(CW\*(C`upper_before_lower\*(C'\fR don't affect +matching and searching, as it doesn't matter whether greater or less. +.ie n .IP """$position = $Collator\->index($string, $substring[, $position])""" 4 +.el .IP "\f(CW$position = $Collator\->index($string, $substring[, $position])\fR" 4 +.IX Item "$position = $Collator->index($string, $substring[, $position])" +.PD 0 +.ie n .IP """($position, $length) = $Collator\->index($string, $substring[, $position])""" 4 +.el .IP "\f(CW($position, $length) = $Collator\->index($string, $substring[, $position])\fR" 4 +.IX Item "($position, $length) = $Collator->index($string, $substring[, $position])" +.PD +If \f(CW$substring\fR matches a part of \f(CW$string\fR, returns +the position of the first occurrence of the matching part in scalar context; +in list context, returns a two-element list of +the position and the length of the matching part. +.Sp +If \f(CW$substring\fR does not match any part of \f(CW$string\fR, +returns \f(CW\-1\fR in scalar context and +an empty list in list context. +.Sp +e.g. when the content of \f(CW$str\fR is \f(CW\*(C`"Ich mu\*(C'\fRß\f(CW\*(C` studieren Perl."\*(C'\fR, +you say the following where \f(CW$sub\fR is \f(CW\*(C`"M\*(C'\fRü\f(CW\*(C`SS"\*(C'\fR, +.Sp +.Vb 6 +\& my $Collator = Unicode::Collate\->new( normalization => undef, level => 1 ); +\& # (normalization => undef) is REQUIRED. +\& my $match; +\& if (my($pos,$len) = $Collator\->index($str, $sub)) { +\& $match = substr($str, $pos, $len); +\& } +.Ve +.Sp +and get \f(CW\*(C`"mu\*(C'\fRß\f(CW\*(C`"\*(C'\fR in \f(CW$match\fR, since \f(CW\*(C`"mu\*(C'\fRß\f(CW\*(C`"\*(C'\fR +is primary equal to \f(CW\*(C`"M\*(C'\fRü\f(CW\*(C`SS"\*(C'\fR. +.ie n .IP """$match_ref = $Collator\->match($string, $substring)""" 4 +.el .IP "\f(CW$match_ref = $Collator\->match($string, $substring)\fR" 4 +.IX Item "$match_ref = $Collator->match($string, $substring)" +.PD 0 +.ie n .IP """($match) = $Collator\->match($string, $substring)""" 4 +.el .IP "\f(CW($match) = $Collator\->match($string, $substring)\fR" 4 +.IX Item "($match) = $Collator->match($string, $substring)" +.PD +If \f(CW$substring\fR matches a part of \f(CW$string\fR, in scalar context, returns +\&\fBa reference to\fR the first occurrence of the matching part +(\f(CW$match_ref\fR is always true if matches, +since every reference is \fBtrue\fR); +in list context, returns the first occurrence of the matching part. +.Sp +If \f(CW$substring\fR does not match any part of \f(CW$string\fR, +returns \f(CW\*(C`undef\*(C'\fR in scalar context and +an empty list in list context. +.Sp +e.g. +.Sp +.Vb 5 +\& if ($match_ref = $Collator\->match($str, $sub)) { # scalar context +\& print "matches [$$match_ref].\en"; +\& } else { +\& print "doesn\*(Aqt match.\en"; +\& } +\& +\& or +\& +\& if (($match) = $Collator\->match($str, $sub)) { # list context +\& print "matches [$match].\en"; +\& } else { +\& print "doesn\*(Aqt match.\en"; +\& } +.Ve +.ie n .IP """@match = $Collator\->gmatch($string, $substring)""" 4 +.el .IP "\f(CW@match = $Collator\->gmatch($string, $substring)\fR" 4 +.IX Item "@match = $Collator->gmatch($string, $substring)" +If \f(CW$substring\fR matches a part of \f(CW$string\fR, returns +all the matching parts (or matching count in scalar context). +.Sp +If \f(CW$substring\fR does not match any part of \f(CW$string\fR, +returns an empty list. +.ie n .IP """$count = $Collator\->subst($string, $substring, $replacement)""" 4 +.el .IP "\f(CW$count = $Collator\->subst($string, $substring, $replacement)\fR" 4 +.IX Item "$count = $Collator->subst($string, $substring, $replacement)" +If \f(CW$substring\fR matches a part of \f(CW$string\fR, +the first occurrence of the matching part is replaced by \f(CW$replacement\fR +(\f(CW$string\fR is modified) and \f(CW$count\fR (always equals to \f(CW1\fR) is returned. +.Sp +\&\f(CW$replacement\fR can be a \f(CW\*(C`CODEREF\*(C'\fR, +taking the matching part as an argument, +and returning a string to replace the matching part +(a bit similar to \f(CW\*(C`s/(..)/$coderef\->($1)/e\*(C'\fR). +.ie n .IP """$count = $Collator\->gsubst($string, $substring, $replacement)""" 4 +.el .IP "\f(CW$count = $Collator\->gsubst($string, $substring, $replacement)\fR" 4 +.IX Item "$count = $Collator->gsubst($string, $substring, $replacement)" +If \f(CW$substring\fR matches a part of \f(CW$string\fR, +all the occurrences of the matching part are replaced by \f(CW$replacement\fR +(\f(CW$string\fR is modified) and \f(CW$count\fR is returned. +.Sp +\&\f(CW$replacement\fR can be a \f(CW\*(C`CODEREF\*(C'\fR, +taking the matching part as an argument, +and returning a string to replace the matching part +(a bit similar to \f(CW\*(C`s/(..)/$coderef\->($1)/eg\*(C'\fR). +.Sp +e.g. +.Sp +.Vb 4 +\& my $Collator = Unicode::Collate\->new( normalization => undef, level => 1 ); +\& # (normalization => undef) is REQUIRED. +\& my $str = "Camel donkey zebra came\ex{301}l CAMEL horse cam\e0e\e0l..."; +\& $Collator\->gsubst($str, "camel", sub { "<b>$_[0]</b>" }); +\& +\& # now $str is "<b>Camel</b> donkey zebra <b>came\ex{301}l</b> <b>CAMEL</b> horse <b>cam\e0e\e0l</b>..."; +\& # i.e., all the camels are made bold\-faced. +\& +\& Examples: levels and ignore_level2 \- what does camel match? +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& level ignore_level2 | camel Camel came\ex{301}l c\-a\-m\-e\-l cam\e0e\e0l +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& 1 false | yes yes yes yes yes +\& 2 false | yes yes no yes yes +\& 3 false | yes no no yes yes +\& 4 false | yes no no no yes +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& 1 true | yes yes yes yes yes +\& 2 true | yes yes yes yes yes +\& 3 true | yes no yes yes yes +\& 4 true | yes no yes no yes +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& note: if variable => non\-ignorable, camel doesn\*(Aqt match c\-a\-m\-e\-l +\& at any level. +.Ve +.SS "Other Methods" +.IX Subsection "Other Methods" +.ie n .IP """%old_tailoring = $Collator\->change(%new_tailoring)""" 4 +.el .IP "\f(CW%old_tailoring = $Collator\->change(%new_tailoring)\fR" 4 +.IX Item "%old_tailoring = $Collator->change(%new_tailoring)" +.PD 0 +.ie n .IP """$modified_collator = $Collator\->change(%new_tailoring)""" 4 +.el .IP "\f(CW$modified_collator = $Collator\->change(%new_tailoring)\fR" 4 +.IX Item "$modified_collator = $Collator->change(%new_tailoring)" +.PD +Changes the value of specified keys and returns the changed part. +.Sp +.Vb 1 +\& $Collator = Unicode::Collate\->new(level => 4); +\& +\& $Collator\->eq("perl", "PERL"); # false +\& +\& %old = $Collator\->change(level => 2); # returns (level => 4). +\& +\& $Collator\->eq("perl", "PERL"); # true +\& +\& $Collator\->change(%old); # returns (level => 2). +\& +\& $Collator\->eq("perl", "PERL"); # false +.Ve +.Sp +Not all \f(CW\*(C`(key,value)\*(C'\fRs are allowed to be changed. +See also \f(CW@Unicode::Collate::ChangeOK\fR and \f(CW@Unicode::Collate::ChangeNG\fR. +.Sp +In the scalar context, returns the modified collator +(but it is \fBnot\fR a clone from the original). +.Sp +.Vb 1 +\& $Collator\->change(level => 2)\->eq("perl", "PERL"); # true +\& +\& $Collator\->eq("perl", "PERL"); # true; now max level is 2nd. +\& +\& $Collator\->change(level => 4)\->eq("perl", "PERL"); # false +.Ve +.ie n .IP """$version = $Collator\->version()""" 4 +.el .IP "\f(CW$version = $Collator\->version()\fR" 4 +.IX Item "$version = $Collator->version()" +Returns the version number (a string) of the Unicode Standard +which the \f(CW\*(C`table\*(C'\fR file used by the collator object is based on. +If the table does not include a version line (starting with \f(CW@version\fR), +returns \f(CW"unknown"\fR. +.ie n .IP UCA_Version() 4 +.el .IP \f(CWUCA_Version()\fR 4 +.IX Item "UCA_Version()" +Returns the revision number of UTS #10 this module consults, +that should correspond with the DUCET incorporated. +.ie n .IP Base_Unicode_Version() 4 +.el .IP \f(CWBase_Unicode_Version()\fR 4 +.IX Item "Base_Unicode_Version()" +Returns the version number of UTS #10 this module consults, +that should correspond with the DUCET incorporated. +.SH EXPORT +.IX Header "EXPORT" +No method will be exported. +.SH INSTALL +.IX Header "INSTALL" +Though this module can be used without any \f(CW\*(C`table\*(C'\fR file, +to use this module easily, it is recommended to install a table file +in the UCA format, by copying it under the directory +<a place in \f(CW@INC\fR>/Unicode/Collate. +.PP +The most preferable one is "The Default Unicode Collation Element Table" +(aka DUCET), available from the Unicode Consortium's website: +.PP +.Vb 1 +\& http://www.unicode.org/Public/UCA/ +\& +\& http://www.unicode.org/Public/UCA/latest/allkeys.txt +\& (latest version) +.Ve +.PP +If DUCET is not installed, it is recommended to copy the file +from http://www.unicode.org/Public/UCA/latest/allkeys.txt +to <a place in \f(CW@INC\fR>/Unicode/Collate/allkeys.txt +manually. +.SH CAVEATS +.IX Header "CAVEATS" +.IP Normalization 4 +.IX Item "Normalization" +Use of the \f(CW\*(C`normalization\*(C'\fR parameter requires the \fBUnicode::Normalize\fR +module (see Unicode::Normalize). +.Sp +If you need not it (say, in the case when you need not +handle any combining characters), +assign \f(CW\*(C`(normalization => undef)\*(C'\fR explicitly. +.Sp +\&\-\- see 6.5 Avoiding Normalization, UTS #10. +.IP "Conformance Test" 4 +.IX Item "Conformance Test" +The Conformance Test for the UCA is available +under <http://www.unicode.org/Public/UCA/>. +.Sp +For \fICollationTest_SHIFTED.txt\fR, +a collator via \f(CW\*(C`Unicode::Collate\->new( )\*(C'\fR should be used; +for \fICollationTest_NON_IGNORABLE.txt\fR, a collator via +\&\f(CW\*(C`Unicode::Collate\->new(variable => "non\-ignorable", level => 3)\*(C'\fR. +.Sp +If \f(CW\*(C`UCA_Version\*(C'\fR is 26 or later, the \f(CW\*(C`identical\*(C'\fR level is preferred; +\&\f(CW\*(C`Unicode::Collate\->new(identical => 1)\*(C'\fR and +\&\f(CW\*(C`Unicode::Collate\->new(identical => 1,\*(C'\fR +\&\f(CW\*(C`variable => "non\-ignorable", level => 3)\*(C'\fR should be used. +.Sp +\&\fBUnicode::Normalize is required to try The Conformance Test.\fR +.Sp +\&\fBEBCDIC-SUPPORT IS EXPERIMENTAL.\fR +.SH "AUTHOR, COPYRIGHT AND LICENSE" +.IX Header "AUTHOR, COPYRIGHT AND LICENSE" +The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki, +<SADAHIRO@cpan.org>. This module is Copyright(C) 2001\-2021, +SADAHIRO Tomoyuki. Japan. All rights reserved. +.PP +This module is free software; you can redistribute it and/or +modify it under the same terms as Perl itself. +.PP +The file Unicode/Collate/allkeys.txt was copied verbatim +from <http://www.unicode.org/Public/UCA/13.0.0/allkeys.txt>. +For this file, Copyright (c) 2020 Unicode, Inc.; distributed +under the Terms of Use in <http://www.unicode.org/terms_of_use.html> +.SH "SEE ALSO" +.IX Header "SEE ALSO" +.IP "Unicode Collation Algorithm \- UTS #10" 4 +.IX Item "Unicode Collation Algorithm - UTS #10" +<http://www.unicode.org/reports/tr10/> +.IP "The Default Unicode Collation Element Table (DUCET)" 4 +.IX Item "The Default Unicode Collation Element Table (DUCET)" +<http://www.unicode.org/Public/UCA/latest/allkeys.txt> +.IP "The conformance test for the UCA" 4 +.IX Item "The conformance test for the UCA" +<http://www.unicode.org/Public/UCA/latest/CollationTest.html> +.Sp +<http://www.unicode.org/Public/UCA/latest/CollationTest.zip> +.IP "Hangul Syllable Type" 4 +.IX Item "Hangul Syllable Type" +<http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt> +.IP "Unicode Normalization Forms \- UAX #15" 4 +.IX Item "Unicode Normalization Forms - UAX #15" +<http://www.unicode.org/reports/tr15/> +.IP "Unicode Locale Data Markup Language (LDML) \- UTS #35" 4 +.IX Item "Unicode Locale Data Markup Language (LDML) - UTS #35" +<http://www.unicode.org/reports/tr35/> |