summaryrefslogtreecommitdiffstats
path: root/intl/icu/source/test/testdata/break_rules/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'intl/icu/source/test/testdata/break_rules/README.md')
-rw-r--r--intl/icu/source/test/testdata/break_rules/README.md100
1 files changed, 100 insertions, 0 deletions
diff --git a/intl/icu/source/test/testdata/break_rules/README.md b/intl/icu/source/test/testdata/break_rules/README.md
new file mode 100644
index 0000000000..1deb4dfc32
--- /dev/null
+++ b/intl/icu/source/test/testdata/break_rules/README.md
@@ -0,0 +1,100 @@
+<!--
+Copyright (C) 2016 and later: Unicode, Inc. and others.
+License & terms of use: http://www.unicode.org/copyright.html
+
+Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.
+-->
+
+This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
+===========================================
+
+The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted
+to provide an expected set of boundary positions to compare with the results from ICU break iteration.
+
+ICU4J also includes copies of the test reference rules, located in the directory
+main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/
+The copies should be kept synchronized; there should be no differences.
+
+Each set of reference break rules lives in a separate file.
+The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp.
+
+Each test file includes
+ - The type of ICU break iterator to create (word, line, sentence, etc.)
+ - The locale to use
+ - Character Class definitions
+ - Rule definitions
+
+To Do
+ - Extend the syntax to support rule tailoring.
+
+
+**character class definition**
+
+ name = set_regular_expression;
+
+*caution* When referenced, these definitions are textually substituted into the overall rule.
+To avoid unexpected behavior, include [brackets] around the full definition
+
+ letter_number = [:Letter:][:Number:];
+
+Will compile, but will produce unexpected results.
+
+ letter_number = [[:Letter:][:Number:]];
+
+is safe. The issue is similar to the problems that can occur with the C preprocessor
+and the use of parentheses around macro paramteters.
+
+**rule definition**
+
+ rule_regular_expression;
+
+**name**
+
+ [A-Za-z_][A-Za-z0-9_]*
+
+**set_regular_expression**
+
+The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern
+(They are mostly the same). May include previously defined set names, which are logically
+expanded in-place.
+
+**rule_regular_expression**
+
+ An ICU Regular Expression.
+ May include set names, which are logically expanded in-place.
+ May include a '÷', which defines a boundary position.
+
+Application of the rules:
+
+Matching begins at the start of text, or after a previously identified boundary.
+The pseudo-code below finds the next boundary.
+
+ while position < end of text
+ for each rule
+ if the text at position matches this rule
+ if the rule has a '÷'
+ Boundary is found.
+ return the position of the '÷' within the match.
+ else
+ position = last character of the rule match.
+ break from the inner rule loop, continue the outer loop.
+
+This differs from the Unicode UAX algorithm in that each position in the text is
+not tested separately. Instead, when a rule match is found, rule application restarts with the last
+character of the preceding rule match. ICU's break rules also operate this way.
+
+Expressing rules this way simplifies UAX rules that have leading or trailing context; it
+is no longer necessary to write expressions that match the context starting from
+any position within it.
+
+This rule form differs from ICU rules in that the rules are applied sequentially, as they
+are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.
+
+**Word Dictionaries**
+
+
+The monkey test does not test dictionary based breaking. The set named 'dictionary' is special,
+as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
+included in the randomly-generated test data.
+
+