1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
|
# © 2016 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html#License
#
# File: Hira_Kana.txt
# Generated from CLDR
#
# note: a global filter is more efficient, but MUST include all source chars
:: [[\u0000-\u007E 、。 \u3099-゜ ァ-ー 。-゚ー[:Hiragana:] [:Katakana:] [:nonspacing mark:]]-[\u309B \u309C]];
:: NFKC (NFC);
# Hiragana-Katakana
# This is largely a one-to-one mapping, but it has a
# few kinks:
# 1. The Katakana va/vi/ve/vo (30F7-30FA) have no
# Hiragana equivalents. We use Hiragana wa/wi/we/wo
# (308F-3092) with a voicing mark (3099), which is
# semantically equivalent. However, this is a non-
# roundtripping transformation.
# 2. The Katakana small ka/ke (30F5,30F6) have no
# Hiragana equiavlents. We convert them to normal
# Hiragana ka/ke (304B,3051). This is a one-way
# information-losing transformation and precludes
# round-tripping of 30F5 and 30F6.
# 3. The combining marks 3099-309C are in the Hiragana
# block, but they apply to Katakana as well, so we
# leave them untouched.
# 4. The Katakana prolonged sound mark 30FC doubles the
# preceding vowel. This is a one-way information-
# losing transformation from Katakana to Hiragana.
# 5. The Katakana middle dot separates words in foreign
# expressions; we leave this unmodified.
# The above points preclude successful round-trip
# transformations of arbitrary input text. However,
# they provide naturalistic results that should conform
# to user expectations.
# Combining equivalents va/vi/ve/vo
わ\u3099 ↔ ヷ;
ゐ\u3099 ↔ ヸ;
ゑ\u3099 ↔ ヹ;
を\u3099 ↔ ヺ;
# One-to-one mappings, main block
# 3041:3094 ↔ 30A1:30F4
# 309D,E ↔ 30FD,E
ぁ ↔ ァ;
あ ↔ ア;
ぃ ↔ ィ;
い ↔ イ;
ぅ ↔ ゥ;
う ↔ ウ;
ぇ ↔ ェ;
え ↔ エ;
ぉ ↔ ォ;
お ↔ オ;
か ↔ カ;
が ↔ ガ;
き ↔ キ;
ぎ ↔ ギ;
く ↔ ク;
ぐ ↔ グ;
け ↔ ケ;
げ ↔ ゲ;
こ ↔ コ;
ご ↔ ゴ;
さ ↔ サ;
ざ ↔ ザ;
し ↔ シ;
じ ↔ ジ;
す ↔ ス;
ず ↔ ズ;
せ ↔ セ;
ぜ ↔ ゼ;
そ ↔ ソ;
ぞ ↔ ゾ;
た ↔ タ;
だ ↔ ダ;
ち ↔ チ;
ぢ ↔ ヂ;
っ ↔ ッ;
つ ↔ ツ;
づ ↔ ヅ;
て ↔ テ;
で ↔ デ;
と ↔ ト;
ど ↔ ド;
な ↔ ナ;
に ↔ ニ;
ぬ ↔ ヌ;
ね ↔ ネ;
の ↔ ノ;
は ↔ ハ;
ば ↔ バ;
ぱ ↔ パ;
ひ ↔ ヒ;
び ↔ ビ;
ぴ ↔ ピ;
ふ ↔ フ;
ぶ ↔ ブ;
ぷ ↔ プ;
へ ↔ ヘ;
べ ↔ ベ;
ぺ ↔ ペ;
ほ ↔ ホ;
ぼ ↔ ボ;
ぽ ↔ ポ;
ま ↔ マ;
み ↔ ミ;
む ↔ ム;
め ↔ メ;
も ↔ モ;
ゃ ↔ ャ;
や ↔ ヤ;
ゅ ↔ ュ;
ゆ ↔ ユ;
ょ ↔ ョ;
よ ↔ ヨ;
ら ↔ ラ;
り ↔ リ;
る ↔ ル;
れ ↔ レ;
ろ ↔ ロ;
ゎ ↔ ヮ;
わ ↔ ワ;
ゐ ↔ ヰ;
ゑ ↔ ヱ;
を ↔ ヲ;
ん ↔ ン;
ゔ ↔ ヴ;
ゝ ↔ ヽ;
ゞ ↔ ヾ;
# One-way Katakana-Hiragana xform of small K ka/ke to
# normal H ka/ke.
か ← ヵ;
け ← ヶ;
# Katakana followed by a prolonged sound mark 30FC has
# its final vowel doubled. This is a Katakana-Hiragana
# one-way information-losing transformation. We
# include the small Katakana (e.g., small A 3041) and
# do not distinguish them from their large
# counterparts. It doesn't make sense to double a
# small counterpart vowel as a small Hiragana vowel, so
# we don't do so. In natural text this should never
# occur anyway. If a 30FC is seen without a preceding
# vowel sound (e.g., after n 30F3) we do not change it.
### $long = ー;
# The following categories are Hiragana, not Katakana
# as might be expected, since by the time we get to the
# 30FC, the preceding character will have already been
# transformed to Hiragana.
# {The following mechanically generated from the
# Unicode 3.0 data:}
$xa = [ \
ぁ あ か が さ ざ \
た だ な は ば ぱ \
ま ゃ や ら ゎ わ \
];
$xi = [ \
ぃ い き ぎ し じ \
ち ぢ に ひ び ぴ \
み り ゐ \
];
$xu = [ \
ぅ う く ぐ す ず \
っ つ づ ぬ ふ ぶ \
ぷ む ゅ ゆ る ゔ \
];
$xe = [ \
ぇ え け げ せ ぜ \
て で ね へ べ ぺ \
め れ ゑ \
];
$xo = [ \
ぉ お こ ご そ ぞ \
と ど の ほ ぼ ぽ \
も ょ よ ろ を \
];
あ ← $xa {ー};
い ← $xi {ー};
う ← $xu {ー};
え ← $xe {ー};
お ← $xo {ー};
:: NFC (NFKC) ;
# note: a global filter is more efficient, but MUST include all source chars!!
:: ([[\u0000-\u007E 、。 \u3099-゜ ァ-ー 。-゚ー[:Hiragana:] [:Katakana:] [:nonspacing mark:]]-[\u309B \u309C]]);
# eof
|