diff options
Diffstat (limited to '')
-rw-r--r-- | intl/icu/source/data/unidata/changes.txt | 5547 |
1 files changed, 5547 insertions, 0 deletions
diff --git a/intl/icu/source/data/unidata/changes.txt b/intl/icu/source/data/unidata/changes.txt new file mode 100644 index 0000000000..9345f26bf5 --- /dev/null +++ b/intl/icu/source/data/unidata/changes.txt @@ -0,0 +1,5547 @@ +* Copyright (C) 2016 and later: Unicode, Inc. and others. +* License & terms of use: http://www.unicode.org/copyright.html +* Copyright (C) 2004-2016, International Business Machines +* Corporation and others. All Rights Reserved. +* +* file name: changes.txt +* encoding: US-ASCII +* tab size: 8 (not used) +* indentation:4 +* +* created on: 2004may06 +* created by: Markus W. Scherer + +* change log for Unicode updates + +For an overview, see https://unicode-org.github.io/icu/processes/unicode-update + +Notes: + +This log includes several command lines as used in the update process. +Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. +Use a console window that is set to that directory, or cd to there, +and then paste the command that follows the $ sign. + +Most command lines use environment variables to make them more portable across versions +and machine configurations. When you set up a console window, copy & paste the `export` commands +from near the top of the current section before pasting tool command lines. +Adjust the environment variables to the current version and your machine setup. +(The command lines are currently as used on Linux.) + +---------------------------------------------------------------------------- *** + +* New ISO 15924 script codes + +Normally, add new script codes as part of a Unicode update. +See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums +and see the change logs below. + +---------------------------------------------------------------------------- *** + +CLDR 43 root collation update for ICU 73 + +Partial update only for the root collation. +See +- https://unicode-org.atlassian.net/browse/CLDR-15946 + Treat quote marks as equivalent when strength=UCOL_PRIMARY +- https://github.com/unicode-org/cldr/pull/2691 + CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks +- https://github.com/unicode-org/cldr/pull/2833 + CLDR-15946 make fancy quotes secondary-different from each other + +The related changes to tailorings were already integrated in an earlier PR for +https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS. + +This update is for the root collation, +which is handled by different tools than the locale data updates. + +* Command-line environment setup + +export UNICODE_DATA=~/unidata/uni15/20220830 +export CLDR_SRC=~/cldr/uni/src +export ICU_ROOT=~/icu/uni +export ICU_SRC=$ICU_ROOT/src +export ICUDT=icudt73b +export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Configure: Build Unicode data for ICU4J + cd $ICU_ROOT/dbg/icu4c + ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh + +* Bazel build process + +See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process +for an overview and for setup instructions. + +Consider running `bazelisk --version` outside of the $ICU_SRC folder +to find out the latest `bazel` version, and +copying that version number into the $ICU_SRC/.bazeliskrc config file. +(Revert if you find incompatibilities, or, better, update our build & config files.) + +* generate data files + +- remember to define the environment variables + (see the start of the section for this Unicode version) +- cd $ICU_SRC +- optional but not necessary: + bazelisk clean + or even + bazelisk clean --expunge +- build/bootstrap/generate new files: + icu4c/source/data/unidata/generate.sh + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, + and a tool-tailored version goes into CLDR, see + https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- generate data files, as above (generate.sh), now to pick up new collation data +- rebuild ICU4C (make clean, make check, as usual) + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", + you need to reconfigure with unicore data; see the "configure" line above. + output: + ... + make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT +- new for ICU 73: also copy the binary data files directly into the ICU4J tree + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** merge the Unicode update branch back onto the main branch +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- if there is a merge conflict in icudata.jar, here is one way to deal with it: + + remove icudata.jar from the commit so that rebasing is trivial + + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar + + ~/icu/uni/src$ git commit -a --amend + + switch to main, pull updates, switch back to the dev branch + + ~/icu/uni/src$ git rebase main + + rebuild icudata.jar + + ~/icu/uni/src$ git commit -a --amend + + ~/icu/uni/src$ git push -f +- make sure that changes to Unicode tools are checked in: + https://github.com/unicode-org/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 15.0 update for ICU 72 + +https://www.unicode.org/versions/Unicode15.0.0/ +https://www.unicode.org/versions/beta-15.0.0.html +https://www.unicode.org/Public/15.0.0/ucd/ +https://www.unicode.org/reports/uax-proposed-updates.html +https://www.unicode.org/reports/tr44/tr44-29.html + +https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 +https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 +https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) + +* Command-line environment setup + +export UNICODE_DATA=~/unidata/uni15/20220830 +export CLDR_SRC=~/cldr/uni/src +export ICU_ROOT=~/icu/uni +export ICU_SRC=$ICU_ROOT/src +export ICUDT=icudt72b +export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + cd $ICU_ROOT/dbg/icu4c + ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh + +*** data files & enums & parser code + +* download files +- same as for the early Unicode Tools setup and data refresh: + https://github.com/unicode-org/unicodetools/blob/main/docs/index.md + https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + old way of fetching files: from the "Public" area on unicode.org + ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + ~ split Unihan into single-property files + ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan + + new way of fetching files, if available: + copy the files from a Unicode Tools workspace that is up to date with + https://github.com/unicode-org/unicodetools + and which might at this point be *ahead* of "Public" + ~ before the Unicode release copy files from "dev" subfolders, for example + https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev + + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt + or from the UCD/cldr/ output folder of the Unicode Tools: + Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. + cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata + or + cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' + +age; 15.0 ; V15_0 + +blk; Arabic_Ext_C ; Arabic_Extended_C + +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H + +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D + +blk; Devanagari_Ext_A ; Devanagari_Extended_A + +blk; Kaktovik_Numerals ; Kaktovik_Numerals + +blk; Kawi ; Kawi + +blk; Nag_Mundari ; Nag_Mundari + +sc ; Kawi ; Kawi + +sc ; Nagm ; Nag_Mundari + -> add new blocks to uchar.h before UBLOCK_COUNT + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' + +10EC0..10EFF; Arabic Extended-C + +11B00..11B5F; Devanagari Extended-A + +11F00..11F5F; Kawi + -13430..1343F; Egyptian Hieroglyph Format Controls + +13430..1345F; Egyptian Hieroglyph Format Controls + +1D2C0..1D2DF; Kaktovik Numerals + +1E030..1E08F; Cyrillic Extended-D + +1E4D0..1E4FF; Nag Mundari + +31350..323AF; CJK Unified Ideographs Extension H + (ignore blocks whose end code point changed) + -> add new blocks to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add new blocks to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + -> add new scripts to uscript.h & com.ibm.icu.lang.UScript + Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) + replace public static final int \1 = \2; \3 + -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* build ICU + to make sure that there are no syntax errors + + $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in i18n/uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* Bazel build process + +See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process +for an overview and for setup instructions. + +Consider running `bazelisk --version` outside of the $ICU_SRC folder +to find out the latest `bazel` version, and +copying that version number into the $ICU_SRC/.bazeliskrc config file. +(Revert if you find incompatibilities, or, better, update our build & config files.) + +* generate data files + +- remember to define the environment variables + (see the start of the section for this Unicode version) +- cd $ICU_SRC +- optional but not necessary: + bazelisk clean +- build/bootstrap/generate new files: + icu4c/source/data/unidata/generate.sh + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters + ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt +- Unicode 6.0..15.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- Note: Some of the collation data and test data will be updated below, + so at this time we might get some collation test failures. + Ignore these for now. +- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files + (no rule changes in Unicode 15) +- update CLDR GraphemeBreakTest.txt + cd ~/unitools/mine/Generated + cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt + cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html + cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata +- Andy helps with RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, + and a tool-tailored version goes into CLDR, see + https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- generate data files, as above (generate.sh), now to pick up new collation data +- update CollationFCD.java: + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java +- rebuild ICU4C (make clean, make check, as usual) + +* Unihan collators + https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md +- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, + check CLDR diffs, copy to CLDR, test CLDR, ... as documented there +- generate ICU zh collation data + instructions inspired by + https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and + https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt + + setup: + export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 + (didn't work without setting JAVA_HOME, + nor with the Google default of /usr/local/buildtools/java/jdk + [Google security limitations in the XML parser]) + export TOOLS_ROOT=~/icu/uni/src/tools + export CLDR_DIR=~/cldr/uni/src + export CLDR_DATA_DIR=~/cldr/uni/src + (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) + cd "$TOOLS_ROOT/cldr/lib" + ./install-cldr-jars.sh "$CLDR_DIR" + + generate the files we need + cd "$TOOLS_ROOT/cldr/cldr-to-icu" + ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' + + diff + cd $ICU_SRC + meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt + meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt + + copy into the source tree + cd $ICU_SRC + cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt + cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", + you need to reconfigure with unicore data; see the "configure" line above. + output: + ... + make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + for example: + ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt + ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt + ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt + --> + +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS + +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS + or: + ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' + --> + +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE + +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE + Unicode 15: + kawi 11F50..11F59 Kawi + nagm 1E4F0..1E4F9 Nag Mundari + https://github.com/unicode-org/cldr/pull/2041 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- if there is a merge conflict in icudata.jar, here is one way to deal with it: + + remove icudata.jar from the commit so that rebasing is trivial + + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar + + ~/icu/uni/src$ git commit -a --amend + + switch to main, pull updates, switch back to the dev branch + + ~/icu/uni/src$ git rebase main + + rebuild icudata.jar + + ~/icu/uni/src$ git commit -a --amend + + ~/icu/uni/src$ git push -f +- make sure that changes to Unicode tools are checked in: + https://github.com/unicode-org/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 14.0 update for ICU 70 + +https://www.unicode.org/versions/Unicode14.0.0/ +https://www.unicode.org/versions/beta-14.0.0.html +https://www.unicode.org/Public/14.0.0/ucd/ +https://www.unicode.org/reports/uax-proposed-updates.html +https://www.unicode.org/reports/tr44/tr44-27.html + +https://unicode-org.atlassian.net/browse/CLDR-14801 +https://unicode-org.atlassian.net/browse/ICU-21635 + +* Command-line environment setup + +export UNICODE_DATA=~/unidata/uni14/20210903 +export CLDR_SRC=~/cldr/uni/src +export ICU_ROOT=~/icu/uni +export ICU_SRC=$ICU_ROOT/src +export ICUDT=icudt70b +export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + cd $ICU_ROOT/dbg/icu4c + ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh + +*** data files & enums & parser code + +* download files +- same as for the early Unicode Tools setup and data refresh: + https://github.com/unicode-org/unicodetools/blob/main/docs/index.md + https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + + split Unihan into single-property files + ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan + + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt + or from the UCD/cldr/ output folder of the Unicode Tools: + Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. + cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata + or + cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: + [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), + (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), + (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' + +age; 14.0 ; V14_0 + +blk; Arabic_Ext_B ; Arabic_Extended_B + +blk; Cypro_Minoan ; Cypro_Minoan + +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B + +blk; Kana_Ext_B ; Kana_Extended_B + +blk; Latin_Ext_F ; Latin_Extended_F + +blk; Latin_Ext_G ; Latin_Extended_G + +blk; Old_Uyghur ; Old_Uyghur + +blk; Tangsa ; Tangsa + +blk; Toto ; Toto + +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A + +blk; Vithkuqi ; Vithkuqi + +blk; Znamenny_Music ; Znamenny_Musical_Notation + +jg ; Thin_Yeh ; Thin_Yeh + +jg ; Vertical_Tail ; Vertical_Tail + +sc ; Cpmn ; Cypro_Minoan + +sc ; Ougr ; Old_Uyghur + +sc ; Tnsa ; Tangsa + +sc ; Toto ; Toto + +sc ; Vith ; Vithkuqi + -> add new blocks to uchar.h before UBLOCK_COUNT + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' + +0870..089F; Arabic Extended-B + +10570..105BF; Vithkuqi + +10780..107BF; Latin Extended-F + +10F70..10FAF; Old Uyghur + -11700..1173F; Ahom + +11700..1174F; Ahom + +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A + +12F90..12FFF; Cypro-Minoan + +16A70..16ACF; Tangsa + -18D00..18D8F; Tangut Supplement + +18D00..18D7F; Tangut Supplement + +1AFF0..1AFFF; Kana Extended-B + +1CF00..1CFCF; Znamenny Musical Notation + +1DF00..1DFFF; Latin Extended-G + +1E290..1E2BF; Toto + +1E7E0..1E7FF; Ethiopic Extended-B + (ignore blocks whose end code point changed) + -> add new blocks to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add new blocks to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + -> add new scripts to uscript.h & com.ibm.icu.lang.UScript + Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) + replace public static final int \1 = \2; \3 + -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + -> add new joining groups to uchar.h & UCharacter.JoiningGroup + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* build ICU + to make sure that there are no syntax errors + + $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in i18n/uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* Bazel build process + +See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process +for an overview and for setup instructions. + +Consider running `bazelisk --version` outside of the $ICU_SRC folder +to find out the latest `bazel` version, and +copying that version number into the $ICU_SRC/.bazeliskrc config file. +(Revert if you find incompatibilities, or, better, update our build & config files.) + +* generate data files + +- remember to define the environment variables + (see the start of the section for this Unicode version) +- cd $ICU_SRC +- optional but not necessary: + bazelisk clean +- build/bootstrap/generate new files: + icu4c/source/data/unidata/generate.sh + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..14.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files +- update CLDR GraphemeBreakTest.txt + cd ~/unitools/mine/Generated + cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt + cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html + cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata +- Andy helps with RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, + and a tool-tailored version goes into CLDR, see + https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- generate data files, as above (generate.sh), now to pick up new collation data +- update CollationFCD.java: + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java +- rebuild ICU4C (make clean, make check, as usual) + +* Unihan collators + https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md +- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, + check CLDR diffs, copy to CLDR, test CLDR, ... as documented there +- generate ICU zh collation data + instructions inspired by + https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and + https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt + + setup: + export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 + (didn't work without setting JAVA_HOME, + nor with the Google default of /usr/local/buildtools/java/jdk + [Google security limitations in the XML parser]) + export TOOLS_ROOT=~/icu/uni/src/tools + export CLDR_DIR=~/cldr/uni/src + export CLDR_DATA_DIR=~/cldr/uni/src + (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) + cd "$TOOLS_ROOT/cldr/lib" + ./install-cldr-jars.sh "$CLDR_DIR" + + generate the files we need + cd "$TOOLS_ROOT/cldr/cldr-to-icu" + ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' + + diff + cd $ICU_SRC + meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt + meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt + + copy into the source tree + cd $ICU_SRC + cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt + cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", + you need to reconfigure with unicore data; see the "configure" line above. + output: + ... + make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + for example: + ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt + ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt + ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt + --> + +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS + Unicode 14: + tnsa 16AC0..16AC9 Tangsa + https://github.com/unicode-org/cldr/pull/1326 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + https://github.com/unicode-org/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 13.0 update for ICU 66 + +https://www.unicode.org/versions/Unicode13.0.0/ +https://www.unicode.org/versions/beta-13.0.0.html +https://www.unicode.org/Public/13.0.0/ucd/ +https://www.unicode.org/reports/uax-proposed-updates.html +https://www.unicode.org/reports/tr44/tr44-25.html + +https://unicode-org.atlassian.net/browse/CLDR-13387 +https://unicode-org.atlassian.net/browse/ICU-20893 + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni13/20200212 +CLDR_SRC=~/cldr/uni/src +ICU_ROOT=~/icu/uni +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt66b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + cd $ICU_ROOT/dbg/icu4c + ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh + +*** data files & enums & parser code + +* download files +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + + split Unihan into single-property files + ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan + + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt + or from the ucd/cldr/ output folder of the Unicode Tools: + Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. + cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://sites.google.com/site/unicodetools/inputdata) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: + [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', + u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), + (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), + (u'InPC', set([u'Top_And_Bottom_And_Left']))] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + blk; Chorasmian ; Chorasmian + blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G + blk; Dives_Akuru ; Dives_Akuru + blk; Khitan_Small_Script ; Khitan_Small_Script + blk; Lisu_Sup ; Lisu_Supplement + blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing + blk; Tangut_Sup ; Tangut_Supplement + blk; Yezidi ; Yezidi + -> add to uchar.h before UBLOCK_COUNT + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + + sc ; Chrs ; Chorasmian + sc ; Diak ; Dives_Akuru + sc ; Kits ; Khitan_Small_Script + sc ; Yezi ; Yezidi + -> uscript.h & com.ibm.icu.lang.UScript + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + + InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left + -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* build ICU (make install) + to make sure that there are no syntax errors, and + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in i18n/uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- update the hardcoded version number there in the DIRECTORY path +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* generate normalization data files + cd $ICU_ROOT/dbg/icu4c + bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) + + $ICU_ROOT/dbg$ + mkdir -p tools/unicode/c + cd tools/unicode/c + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c +- tool failure: + genprops: Script_Extensions indexes overflow bit field + genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR + -> uprops.icu data file format : + add two more bits to store a script code or Script_Extensions index + -> generator code, C++ & Java runtime, uprops.icu format version 7.7 +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..13.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files +- Andy helps with RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA + diff the main mapping file, look for bad changes + (for example, more bytes per weight for common characters) + ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt + ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt + +- CLDR root data files are checked into $CLDR_SRC/common/uca/ + cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- run genuca + $ICU_ROOT/dbg/tools/unicode/c$ + genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ + genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c +- rebuild ICU4C + +* Unihan collators + https://sites.google.com/site/unicodetools/unihan +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -ea + -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools + -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src + -DUVERSION=13.0.0 +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd $CLDR_SRC + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd $CLDR_SRC + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- run CLDR unit tests, commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation + -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental + -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll + -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation + zh + and VM arguments + -ea + -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + for example, look for + ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt + in new blocks (Blocks.txt) + Unicode 13: + diak 11950..11959 Dives_Akuru + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + http://www.unicode.org/utility/trac/log/trunk/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 12.1 update for ICU 64.2 + +** This is an abbreviated update with one new character for the new +** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA +https://en.wikipedia.org/wiki/Reiwa_period + +http://www.unicode.org/versions/Unicode12.1.0/ + +ICU-20497 Unicode 12.1 + +cldrbug 11978: Unicode 12.1 + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni121/20190403 +CLDR_SRC=~/svn.cldr/uni +ICU_ROOT=~/icu/uni +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt64b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + cd $ICU_ROOT/dbg/icu4c + ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh + +*** data files & enums & parser code + +* download files +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://sites.google.com/site/unicodetools/inputdata) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- update the hardcoded version number there in the DIRECTORY path +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* generate normalization data files + cd $ICU_ROOT/dbg/icu4c + bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) + + $ICU_ROOT/dbg$ + mkdir -p tools/unicode/c + cd tools/unicode/c + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c + genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ + genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..12.1: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA + diff the main mapping file, look for bad changes + (for example, more bytes per weight for common characters) + ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt + ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt + +- CLDR root data files are checked into $CLDR_SRC/common/uca/ + cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- run genuca, see command line above +- rebuild ICU4C + +* Unihan collators + https://sites.google.com/site/unicodetools/unihan +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -ea + -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools + -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni + -DUVERSION=12.1.0 +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd $CLDR_SRC + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd $CLDR_SRC + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- run CLDR unit tests, commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation + -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental + -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll + -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation + zh + and VM arguments + -ea + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + for example, look for + ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt + in new blocks (Blocks.txt) + Unicode 12: using Unicode 12 CLDR ticket #11478 + hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong + wcho 1E2F0..1E2F9 Wancho + Unicode 11: using Unicode 11 CLDR ticket #10978 + rohg 10D30..10D39 Hanifi_Rohingya + gong 11DA0..11DA9 Gunjala_Gondi + Earlier: CLDR tickets specific to adding new numbering systems. + Unicode 10: http://unicode.org/cldr/trac/ticket/10219 + Unicode 9: http://unicode.org/cldr/trac/ticket/9692 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + http://www.unicode.org/utility/trac/log/trunk/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 12.0 update for ICU 64 + +http://www.unicode.org/versions/Unicode12.0.0/ +http://unicode.org/versions/beta-12.0.0.html +https://www.unicode.org/review/pri389/ +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/reports/tr44/tr44-23.html + +ICU-20203 Unicode 12 + +ICU-20111 move text layout properties data into a data file + +cldrbug 11478: Unicode 12 +Accidentally used ^/trunk instead of ^/branches/markus/uni12 + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni12/20190309 +CLDR_SRC=~/svn.cldr/uni +ICU_ROOT=~/icu/uni +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt63b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* download files +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://sites.google.com/site/unicodetools/inputdata) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: + [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', + u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', + u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), + (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls + blk; Elymaic ; Elymaic + blk; Nandinagari ; Nandinagari + blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong + blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers + blk; Small_Kana_Ext ; Small_Kana_Extension + blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A + blk; Tamil_Sup ; Tamil_Supplement + blk; Wancho ; Wancho + -> add to uchar.h + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 + + sc ; Elym ; Elymaic + sc ; Hmnp ; Nyiakeng_Puachue_Hmong + sc ; Nand ; Nandinagari + sc ; Wcho ; Wancho + -> uscript.h & com.ibm.icu.lang.UScript + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- update the hardcoded version number there in the DIRECTORY path +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* generate normalization data files + cd $ICU_ROOT/dbg/icu4c + bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) + + $ICU_ROOT/dbg$ + mkdir -p tools/unicode/c + cd tools/unicode/c + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c + genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ + genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..12.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- update test of default bidi classes: + Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, + see diffs in DerivedBidiClass.txt + + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] + + UCharacterTest.java TestIteration() defaultBidi[] +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA + diff the main mapping file, look for bad changes + (for example, more bytes per weight for common characters) + ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt + ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt + +- CLDR root data files are checked into $CLDR_SRC/common/uca/ + cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: + FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) + (add the character to genuca.cpp sampleCharsToScripts[]) + + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) + and cache its values. + Works as long as the script metadata is updated before the collation data. +- rebuild ICU4C + +* Unihan collators + https://sites.google.com/site/unicodetools/unihan +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -ea + -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools + -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni + -DUVERSION=12.0.0 +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd $CLDR_SRC + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd $CLDR_SRC + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- run CLDR unit tests, commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation + -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental + -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll + -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation + zh + and VM arguments + -ea + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt63l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + for example, look for + ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt + in new blocks (Blocks.txt) + Unicode 12: using Unicode 12 CLDR ticket #11478 + hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong + wcho 1E2F0..1E2F9 Wancho + Unicode 11: using Unicode 11 CLDR ticket #10978 + rohg 10D30..10D39 Hanifi_Rohingya + gong 11DA0..11DA9 Gunjala_Gondi + Earlier: CLDR tickets specific to adding new numbering systems. + Unicode 10: http://unicode.org/cldr/trac/ticket/10219 + Unicode 9: http://unicode.org/cldr/trac/ticket/9692 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + http://www.unicode.org/utility/trac/log/trunk/unicodetools + +---------------------------------------------------------------------------- *** + +ICU 63 addition of ICU support of text layout properties InPC, InSC, vo + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni11/20180609 +CLDR_SRC=~/svn.cldr/uni +ICU_ROOT=~/icu/mine +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt62b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** Links + +https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC +https://unicode-org.atlassian.net/browse/ICU-12850 vo + +*** data files & enums & parser code + +* API additions +- for each of the three new enumerated properties + + uchar.h: add the enum UProperty constant UCHAR_<long prop name> + + uchar.h: update UCHAR_INT_LIMIT + + uchar.h: add the enum U<long prop name> + with constants U_<short prop name>_<long value name> + + UProperty.java: add the constant <long prop name> + + UProperty.java: update INT_LIMIT + + UCharacter.java: add the interface <long prop name> + with constants <long value name> + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + It also writes tools/unicode/c/genprops/pnames_data.h with property and value + names and aliases. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +* preparseucd.py changes +- add new property short names (uppercase) to _prop_and_value_re + so that ParseUCharHeader() parses the new enum constants + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) + + $ICU_ROOT/dbg$ + mkdir -p tools/unicode/c + cd tools/unicode/c + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c +- rebuild ICU (make install) & tools + +* write data for runtime, hardcoded for now +- add genprops/layoutpropsbuilder.cpp with pieces from sibling files +- generate new icu4c/source/common/ulayout_props_data.h +- for each of the three new enumerated properties + + int property max value + + small, 8-bit UCPTrie + (A small 16-bit trie with bit fields for these three properties + is very nearly the same size as the sum of the three.) + +* wire into C++ +- uprops.cpp: #include ulayout_props_data.h +- uprops.cpp: add getInPC() etc. functions +- uprops.cpp: add lines to intProps[], include max values +- uprops.h: add UPropertySource constants +- uprops.cpp: add uprops_addPropertyStarts(src) +- uniset_props.cpp: add to UnicodeSet_initInclusion() +- intltest/ucdtest.cpp: write unit tests + +* update Java data files +- refresh just the pnames.icu file with the new property [value] names, just to be safe +- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* wire into Java +- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ +- UCharacterProperty.java: for each new property + + create a nested class to hold its CodePointTrie + + initialize it from a string literal + + paste in the initializer printed by genprops + + add a new IntProperty object to the intProps[] array + + use the correct max int value for each property, also printed by genprops +- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) +- UnicodeSet.java: add to getInclusions() +- UCharacterTest.java: write unit tests + +---------------------------------------------------------------------------- *** + +Unicode 11.0 update for ICU 62 + +http://www.unicode.org/versions/Unicode11.0.0/ +http://unicode.org/versions/beta-11.0.0.html +https://www.unicode.org/review/pri372/ +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/reports/tr44/tr44-21.html + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni11/20180521 +CLDR_SRC=~/svn.cldr/uni +ICU_ROOT=~/svn.icu/uni +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt61b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** ICU Trac + +- ticket:13630: Unicode 11 +- ^/branches/markus/uni11 + +*** CLDR Trac + +- cldrbug 10978: Unicode 11 +- ^/branches/markus/uni11 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* download files +- mkdir -p $UNICODE_DATA +- download Unicode files into $UNICODE_DATA + + subfolders: emoji, idna, security, ucd, uca + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip + +* for manual diffs and for Unicode Tools input data updates: + remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://sites.google.com/site/unicodetools/inputdata) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* preparseucd.py changes +- fix other errors + NameError: unknown property Extended_Pictographic + -> add Extended_Pictographic binary property + -> add new short names for all Emoji properties + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: + [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', + u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', + u'Indic_Siyaq_Numbers'])), + (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), + (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), + (u'GCB', set([u'LinkC', u'Virama'])), + (u'WB', set([u'WSegSpace']))] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + blk; Chess_Symbols ; Chess_Symbols + blk; Dogra ; Dogra + blk; Georgian_Ext ; Georgian_Extended + blk; Gunjala_Gondi ; Gunjala_Gondi + blk; Hanifi_Rohingya ; Hanifi_Rohingya + blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers + blk; Makasar ; Makasar + blk; Mayan_Numerals ; Mayan_Numerals + blk; Medefaidrin ; Medefaidrin + blk; Old_Sogdian ; Old_Sogdian + blk; Sogdian ; Sogdian + -> add to uchar.h + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + + GCB; LinkC ; LinkingConsonant + GCB; Virama ; Virama + -> uchar.h & UCharacter.GraphemeClusterBreak + -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 + + InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed + -> ignore: ICU does not yet support this property + + jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya + jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa + -> uchar.h & UCharacter.JoiningGroup + + sc ; Dogr ; Dogra + sc ; Gong ; Gunjala_Gondi + sc ; Maka ; Makasar + sc ; Medf ; Medefaidrin + sc ; Rohg ; Hanifi_Rohingya + sc ; Sogd ; Sogdian + sc ; Sogo ; Old_Sogdian + -> uscript.h & com.ibm.icu.lang.UScript + -> Nushu had been added already + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + + WB ; WSegSpace ; WSegSpace + -> uchar.h & UCharacter.WordBreak + +* New short names for emoji properties +- see UTS #51 +- short names set in preparseucd.py + +* New properties +- boolean emoji property Extended_Pictographic + -> added in preparseucd.py + -> uchar.h & UProperty.java +- misc. property Equivalent_Unified_Ideograph (EqUIdeo) + as shown in PropertyValueAliases.txt + -> ignore for now + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* update spoof checker UnicodeSet initializers: + inclusionPat & recommendedPat in uspoof.cpp + INCLUSION & RECOMMENDED in SpoofChecker.java +- make sure that the Unicode Tools tree contains the latest security data files +- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator +- update the hardcoded version number there in the DIRECTORY path +- run the tool (no special environment variables needed) +- copy & paste from the Console output into the .cpp & .java files + +* generate normalization data files + cd $ICU_ROOT/dbg/icu4c + bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) + + $ICU_ROOT/dbg$ + mkdir -p tools/unicode/c + cd tools/unicode/c + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c + genuca/genuca --hanOrder implicit $ICU_SRC/icu4c + genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c +- rebuild ICU (make install) & tools + +* Fix case props + genprops error: casepropsbuilder: too many exceptions words + genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR +- With the addition of Georgian Mtavruli capital letters, + there are now too many simple case mappings with big mapping deltas + that yield uncompressible exceptions. +- Changing the data structure (now formatVersion 4), + adding one bit for no-simple-case-folding (for Cherokee), and + one optional slot for a big delta (for most faraway mappings), + together with another bit for whether that is negative. + This makes most Cherokee & Georgian etc. case mappings compressible, + reducing the number of exceptions words. +- Further changes to gain one more bit for the exceptions index, + for future growth. Details see casepropsbuilder.cpp. + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..11.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- Andy handles RBBI & spoof check test failures + +- Errors in char.txt, word.txt, word_POSIX.txt like + createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 + because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. + -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them + not empty, just to get ICU building. + -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables + and properties together with the rules that used them (GB 10, WB 14). + -> Andy adjusts the rule sets further to sync with + Unicode 11 grapheme, word, and line break spec changes. + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA + diff the main mapping file, look for bad changes + (for example, more bytes per weight for common characters) + ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt + ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt + +- CLDR root data files are checked into $CLDR_SRC/common/uca/ + cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: + FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) + (add the character to genuca.cpp sampleCharsToScripts[]) + + look up the USCRIPT_ code for the new sample characters + (should be obvious from the comment in the error output) + + *add* mappings to sampleCharsToScripts[], do not replace them + (in case the script sample characters flip-flop) + + insert new scripts in DUCET script order, see the top_byte table + at the beginning of FractionalUCA.txt +- rebuild ICU4C + +* Unihan collators + https://sites.google.com/site/unicodetools/unihan +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -ea + -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools + -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni + -DUVERSION=11.0.0 +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd $CLDR_SRC + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd $CLDR_SRC + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- run CLDR unit tests, commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation + -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental + -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll + -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation + zh + and VM arguments + -ea + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt61l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR + Unicode 11: using Unicode 11 CLDR ticket #10978 + rohg 10D30..10D39 Hanifi_Rohingya + gong 11DA0..11DA9 Gunjala_Gondi + Earlier: CLDR tickets specific to adding new numbering systems. + Unicode 10: http://unicode.org/cldr/trac/ticket/10219 + Unicode 9: http://unicode.org/cldr/trac/ticket/9692 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + http://www.unicode.org/utility/trac/log/trunk/unicodetools + +---------------------------------------------------------------------------- *** + +Unicode 10.0 update for ICU 60 + +http://www.unicode.org/versions/Unicode10.0.0/ +http://www.unicode.org/versions/beta-10.0.0.html +http://blog.unicode.org/2017/03/unicode-100-beta-review.html +http://www.unicode.org/review/pri350/ +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/reports/tr44/tr44-19.html + +* Command-line environment setup + +UNICODE_DATA=~/unidata/uni10/20170605 +CLDR_SRC=~/svn.cldr/uni10 +ICU_ROOT=~/svn.icu/uni10 +ICU_SRC=$ICU_ROOT/src +ICUDT=icudt60b +ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in +ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib + +*** ICU Trac + +- ticket:12985: Unicode 10 +- ticket:13061: undo hacks from emoji 5.0 update +- ticket:13062: add Emoji_Component property +- ^/branches/markus/uni10 + +*** CLDR Trac + +- cldrbug 10055: Unicode 10 +- cldrbug 9882: Unicode 10 script metadata +- cldrbug 10219: numbering systems for Unicode 10 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* download files +- mkdir -p $UNICODE_DATA +- download Unicode 10.0 files into $UNICODE_DATA + + subfolders: ucd, uca, idna, security + + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- download emoji 5.0 files into $UNICODE_DATA/emoji + +* for manual diffs: remove version suffixes from the file names + ~$ unidata/desuffixucd.py $UNICODE_DATA + (see https://sites.google.com/site/unicodetools/inputdata) + +* process and/or copy files +- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + + For debugging, and tweaking how ppucd.txt is written, + the tool has an --only_ppucd option: + py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile + +- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* preparseucd.py changes +- remove or add new Unicode scripts from/to the + only-in-ISO-15924 list according to the error messages: + ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 + -> adjust _scripts_only_in_iso15924 as indicated +- fix other errors + Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] + -> add vo=Vertical_Orientation to _ignored_properties + -> later removed again, parsing the file, even though we do not yet store data for runtime use + +* new constants for new property values +- preparseucd.py error: + ValueError: missing uchar.h enum constants for some property values: + [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', + u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), + (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', + u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', + u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), + (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] + = PropertyValueAliases.txt new property values (diff old & new .txt files) + blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F + blk; Kana_Ext_A ; Kana_Extended_A + blk; Masaram_Gondi ; Masaram_Gondi + blk; Nushu ; Nushu + blk; Soyombo ; Soyombo + blk; Syriac_Sup ; Syriac_Supplement + blk; Zanabazar_Square ; Zanabazar_Square + -> add to uchar.h + use long property names for enum constants, + for the trailing comment get the block start code point: diff old & new Blocks.txt + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + + jg ; Malayalam_Bha ; Malayalam_Bha + jg ; Malayalam_Ja ; Malayalam_Ja + jg ; Malayalam_Lla ; Malayalam_Lla + jg ; Malayalam_Llla ; Malayalam_Llla + jg ; Malayalam_Nga ; Malayalam_Nga + jg ; Malayalam_Nna ; Malayalam_Nna + jg ; Malayalam_Nnna ; Malayalam_Nnna + jg ; Malayalam_Nya ; Malayalam_Nya + jg ; Malayalam_Ra ; Malayalam_Ra + jg ; Malayalam_Ssa ; Malayalam_Ssa + jg ; Malayalam_Tta ; Malayalam_Tta + -> uchar.h & UCharacter.JoiningGroup + + sc ; Gonm ; Masaram_Gondi + sc ; Nshu ; Nushu + sc ; Soyo ; Soyombo + sc ; Zanb ; Zanabazar_Square + -> uscript.h & com.ibm.icu.lang.UScript + -> Nushu had been added already + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* New properties as shown in PropertyValueAliases.txt changes +- boolean Emoji_Component from emoji 5 + -> uchar.h & UProperty.java +- boolean + # Regional_Indicator (RI) + + RI ; N ; No ; F ; False + RI ; Y ; Yes ; T ; True + -> uchar.h & UProperty.java + -> single immutable range, to be hardcoded +- boolean + # Prepended_Concatenation_Mark (PCM) + + PCM; N ; No ; F ; False + PCM; Y ; Yes ; T ; True + -> was new in Unicode 9 + -> uchar.h & UProperty.java +- enumerated + # Vertical_Orientation (vo) + + vo ; R ; Rotated + vo ; Tr ; Transformed_Rotated + vo ; Tu ; Transformed_Upright + vo ; U ; Upright + -> only pre-parsed for now, but not yet stored for runtime use + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt + +* generate normalization data files + cd $ICU_ROOT/dbg/icu4c + bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +$ICU_SRC/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) + + $ICU_ROOT/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + $ICU_ROOT/dbg/tools/unicode/c$ + genprops/genprops $ICU_SRC/icu4c + genuca/genuca --hanOrder implicit $ICU_SRC/icu4c + genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..10.0: U+2260, U+226E, U+226F +- nothing new in this Unicode version, no test file to update + +* run & fix ICU4C tests +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA +- CLDR root data files are checked into $CLDR_SRC/common/uca/ + cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ + +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt + +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: + FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) + (add the character to genuca.cpp sampleCharsToScripts[]) + + look up the USCRIPT_ code for the new sample characters + (should be obvious from the comment in the error output) + + *add* mappings to sampleCharsToScripts[], do not replace them + (in case the script sample characters flip-flop) + + insert new scripts in DUCET script order, see the top_byte table + at the beginning of FractionalUCA.txt +- rebuild ICU4C + +* Unihan collators + https://sites.google.com/site/unicodetools/unihan +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -ea + -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools + -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 + -DUVERSION=10.0.0 +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd $CLDR_SRC + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd $CLDR_SRC + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- run CLDR unit tests, commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation + -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental + -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll + -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation + zh + and VM arguments + -ea + -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt60l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd $ICU_ROOT/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data +or +- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC/icu4c/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** CLDR numbering systems +- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket + Unicode 10: http://unicode.org/cldr/trac/ticket/10219 + Unicode 9: http://unicode.org/cldr/trac/ticket/9692 + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools are checked in: + http://www.unicode.org/utility/trac/log/trunk/unicodetools + +---------------------------------------------------------------------------- *** + +Emoji 5.0 update for ICU 59 +- ICU 59 mostly remains on Unicode 9.0 +- except updates bidi and segmentation data to Unicode 10 beta + +First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. + +* Command-line environment setup + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c +ICUDT=icudt59b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in +UNIDATA=$ICU4C_SRC_DIR/source/data/unidata + +*** ICU Trac + +- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released +- changes directly on trunk + +*** data files & enums & parser code + +* download files + +- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) +- download emoji 5.0 beta files into the same uni90e50 folder +- download Unicode 10.0 beta files: ucd + + copy Unicode 10 bidi files to the uni90e50/ucd folder: + BidiBrackets.txt + BidiCharacterTest.txt + BidiMirroring.txt + BidiTest.txt + extracted/DerivedBidiClass.txt + + copy Unicode 10 segmentation files to the uni90e50/ucd folder: + LineBreak.txt + auxiliary/* + +* preparseucd.py changes +- adjust for combined trunks +- write new copyright lines +- ignore new Emoji_Component property for now + +* process and/or copy files +- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR + + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + +- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date + +* build Unicode tools using CMake+make + +~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) +# Location of the ICU4C source tree. +set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) + + ~/svn.icu/trunk/dbg/tools/unicode/c$ + cmake ../../../../src/tools/unicode/c + make + +* generate core properties data files + ~/svn.icu/trunk/dbg/tools/unicode/c$ + genprops/genprops $ICU4C_SRC_DIR +- rebuild ICU (make install) & tools + +* run & fix ICU4C tests +- Andy handles RBBI & spoof check test failures + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt59l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data +or +- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU4C_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +---------------------------------------------------------------------------- *** + +Unicode 9.0 update for ICU 58 + +* Command-line environment setup + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt58b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +http://www.unicode.org/review/pri323/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/versions/beta-9.0.0.html +http://www.unicode.org/versions/Unicode9.0.0/ +http://www.unicode.org/reports/tr44/tr44-17.html + +*** ICU Trac + +- ticket:12526: integrate Unicode 9 +- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b +- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b + +*** CLDR Trac + +- cldrbug 9414: UCA 9 +- ^/branches/markus/uni90 at r11518 from trunk at r11517 + +- cldrbug 8745: Unicode 9.0 script metadata + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- only for manual diffs: remove version suffixes from the file names + ~/unidata/uni70/20140403$ ../../desuffixucd.py . + (see https://sites.google.com/site/unicodetools/inputdata) +- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + +- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt + and copy to $UNIDATA + cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA + +* preparseucd.py changes +- remove or add new Unicode scripts from/to the + only-in-ISO-15924 list according to the error messages: + ValueError: remove ['Tang'] from _scripts_only_in_iso15924 + ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD + ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD + ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- DerivedNumericValues.txt new numeric values + 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH + 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH + 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS + 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH + 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS + -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), + uchar.c, UCharacterProperty.java + to support a new series of values +- adjust preparseucd.py for Tangut algorithmic names + in ppucd.txt: + algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- + -> + algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- +- avoid block-compressing most String/Miscellaneous property values, + triggered by genprops not coping with a multi-code point Case_Folding on + block;1C80..1C8F;...;Cased;cf=0442;CWCF;... + keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors + +* PropertyAliases.txt changes +- 1 new property PCM=Prepended_Concatenation_Mark + Ignore: Only useful for layout engines. + Ok to list in ppucd.txt. + +* PropertyValueAliases.txt new property values + blk; Adlam ; Adlam + blk; Bhaiksuki ; Bhaiksuki + blk; Cyrillic_Ext_C ; Cyrillic_Extended_C + blk; Glagolitic_Sup ; Glagolitic_Supplement + blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation + blk; Marchen ; Marchen + blk; Mongolian_Sup ; Mongolian_Supplement + blk; Newa ; Newa + blk; Osage ; Osage + blk; Tangut ; Tangut + blk; Tangut_Components ; Tangut_Components + -> add to uchar.h + use long property names for enum constants + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 + + GCB; EB ; E_Base + GCB; EBG ; E_Base_GAZ + GCB; EM ; E_Modifier + GCB; GAZ ; Glue_After_Zwj + GCB; ZWJ ; ZWJ + -> uchar.h & UCharacter.GraphemeClusterBreak + + jg ; African_Feh ; African_Feh + jg ; African_Noon ; African_Noon + jg ; African_Qaf ; African_Qaf + -> uchar.h & UCharacter.JoiningGroup + + lb ; EB ; E_Base + lb ; EM ; E_Modifier + lb ; ZWJ ; ZWJ + -> uchar.h & UCharacter.LineBreak + + sc ; Adlm ; Adlam + sc ; Bhks ; Bhaiksuki + sc ; Marc ; Marchen + sc ; Newa ; Newa + sc ; Osge ; Osage + sc ; Tang ; Tangut + -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript + + WB ; EB ; E_Base + WB ; EBG ; E_Base_GAZ + WB ; EM ; E_Modifier + WB ; GAZ ; Glue_After_Zwj + WB ; ZWJ ; ZWJ + -> uchar.h & UCharacter.WordBreak + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt + +* generate normalization data files + cd $ICU_ROOT/dbg + bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + + # Location (--prefix) of where ICU was installed. + set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) + # Location of the ICU source tree. + set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) + + ~/svn.icutools/trunk/dbg/unicode/c$ + cmake ../../../src/unicode/c + make + +* generate core properties data files + ~/svn.icutools/trunk/dbg/unicode/c$ + genprops/genprops $ICU_SRC_DIR + genuca/genuca --hanOrder implicit $ICU_SRC_DIR + genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..9.0: U+2260, U+226E, U+226F +- nothing new in 9.0, no test file to update + +* run & fix ICU4C tests +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA +- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ + cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ + +- cd (CLDR UCA branch)/common/uca/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt + +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: + FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) + (add the character to genuca.cpp sampleCharsToScripts[]) + + look up the USCRIPT_ code for the new sample characters + (should be obvious from the comment in the error output) + + *add* mappings to sampleCharsToScripts[], do not replace them + (in case the script sample characters flip-flop) + + insert new scripts in DUCET script order, see the top_byte table + at the beginning of FractionalUCA.txt +- rebuild ICU4C + +* Unihan collators +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollators + with VM arguments + -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk + -DOTHER_WORKSPACE=/home/mscherer/svn.unitools + -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data + -DCLDR_DIR=/home/mscherer/svn.cldr/trunk + -DUVERSION=9.0.0 + -ea +- run Unicode Tools + org.unicode.draft.GenerateUnihanCollatorFiles + with the same arguments +- check CLDR diffs + cd ~/svn.cldr/trunk + meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml + meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml +- copy to CLDR + cd ~/svn.cldr/trunk + cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml + cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml +- commit to CLDR +- generate ICU zh collation data: run CLDR + org.unicode.cldr.icu.NewLdml2IcuConverter + with program arguments + -t collation + -s /home/mscherer/svn.cldr/trunk/common/collation + -m /home/mscherer/svn.cldr/trunk/common/supplemental + -d /home/mscherer/svn.icu/trunk/src/source/data/coll + -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation + zh + and VM arguments + -DCLDR_DIR=/home/mscherer/svn.cldr/trunk +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt58l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd ~/svn.icu/trunk/dbg/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** LayoutEngine script information + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + + (It also generates ScriptRunData.cpp, which is no longer needed.) + + It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages + (a plain text file) + which maps ICU versions to the numbers of script/language constants + that were added then. + (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) + + The generated files have a current copyright date and "@deprecated" statement. + +* Review changes, fix Java tool if necessary, and copy to ICU4C + cd ~/svn.icu4j/trunk/src + meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools & ICU tools are checked in + http://www.unicode.org/utility/trac/log/trunk/unicodetools + http://bugs.icu-project.org/trac/log/tools/trunk + +---------------------------------------------------------------------------- *** + +New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 + +Adding +- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge +- new combination/alias codes: Hanb, Jamo + - used in CLDR 29 and in spoof checker +- new Z* code: Zsye + +Add new codes to uscript.h & UScript.java, see Unicode update logs. + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2; \3 + +Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, +add new script codes. +"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. + +Note: If we have to run preparseucd.py again before the Unicode 9 update, +then we need to manually keep/restore the new script codes. + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt57b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, +see https://unicode-org.atlassian.net/browse/ICU-12141 + +make install, then icutools cmake & make, then +~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR + +Generate Java data as usual, only update pnames.icu & uprops.icu. + +*** LayoutEngine script information + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + + (It also generates ScriptRunData.cpp, which is no longer needed.) + + It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages + (a plain text file) + which maps ICU versions to the numbers of script/language constants + that were added then. + (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) + + The generated files have a current copyright date and "@deprecated" statement. + +* Review changes, fix Java tool if necessary, and copy to ICU4C + cd ~/svn.icu4j/trunk/src + meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout + +---------------------------------------------------------------------------- *** + +Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 + +Edit preparseucd.py to add & parse new properties. +They share the UCD property namespace but are not listed in PropertyAliases.txt. + +Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ +Initial data from emoji/2.0/ + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt56b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +Add binary-property constants to uchar.h enum UProperty & UProperty.java. + +~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src +(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) + +Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java + +make install, then icutools cmake & make, then +~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR + +Generate Java data as usual, only update pnames.icu & uprops.icu. + +---------------------------------------------------------------------------- *** + +Unicode 8.0 update for ICU 56 + +* Command-line environment setup + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt56b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +http://www.unicode.org/review/pri297/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://unicode.org/versions/beta-8.0.0.html +http://www.unicode.org/versions/Unicode8.0.0/ +http://www.unicode.org/reports/tr44/tr44-15.html + +*** ICU Trac + +- ticket:11574: Unicode 8 +- C++ branches/markus/uni80 at r37351 from trunk at r37343 +- Java branches/markus/uni80 at r37352 from trunk at r37338 + +*** CLDR Trac + +- cldrbug 8311: UCA 8 +- branches/markus/uni80 at r11518 from trunk at r11517 + +- cldrbug 8109: Unicode 8.0 script metadata +- cldrbug 8418: Updated segmentation for Unicode 8.0 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- only for manual diffs: remove version suffixes from the file names + ~/unidata/uni70/20140403$ ../../desuffixucd.py . + (see https://sites.google.com/site/unicodetools/inputdata) +- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + +- also: from http://unicode.org/Public/security/8.0.0/ download new + confusables.txt & confusablesWholeScript.txt + and copy to $UNIDATA + ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA + ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA + +* initial preparseucd.py changes +- remove new Unicode scripts from the + only-in-ISO-15924 list according to the error message: + ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] + from _scripts_only_in_iso15924 + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- property and file name change: + IndicMatraCategory -> IndicPositionalCategory +- UnicodeData.txt unusual numeric values (improper fractions) + 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; + 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; + 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; + 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; + 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; + 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; + 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; + 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; + 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; + 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; + -> change preparseucd.py to map them to proper fractions (e.g., 1/6) + which are listed in DerivedNumericValues.txt; + keeps storage in data file simple + +* PropertyValueAliases.txt changes +- 10 new Block (blk) values: + blk; Ahom ; Ahom + blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs + blk; Cherokee_Sup ; Cherokee_Supplement + blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E + blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform + blk; Hatran ; Hatran + blk; Multani ; Multani + blk; Old_Hungarian ; Old_Hungarian + blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs + blk; Sutton_SignWriting ; Sutton_SignWriting + -> add to uchar.h + use long property names for enum constants + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 6 new Script (sc) values: + sc ; Ahom ; Ahom + sc ; Hatr ; Hatran + sc ; Hluw ; Anatolian_Hieroglyphs + sc ; Hung ; Old_Hungarian + sc ; Mult ; Multani + sc ; Sgnw ; SignWriting + -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt + +* generate normalization data files + cd $ICU_ROOT/dbg + bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + + # Location (--prefix) of where ICU was installed. + set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) + # Location of the ICU source tree. + set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) + + ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c + ~/svn.icutools/trunk/dbg/unicode/c$ make + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..8.0: U+2260, U+226E, U+226F +- nothing new in 8.0, no test file to update + +* run & fix ICU4C tests +- bad Cherokee case folding due to difference in fallbacks: + UCD case folding falls back to no mapping, + ICU runtime case folding falls back to lowercasing; + fixed casepropsbuilder.cpp to generate scf mappings to self + when there is an slc mapping but no scf +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA +- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ +- cd (CLDR UCA branch)/common/uca/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml <collation type="private-unihan"> + and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt + (add the character to genuca.cpp sampleCharsToScripts[]) + + look up the script for the new sample characters + (e.g., in FractionalUCA.txt) + + *add* mappings to sampleCharsToScripts[], do not replace them + (in case the script sample characters flip-flop) + + insert new scripts in DUCET script order, see the top_byte table + at the beginning of FractionalUCA.txt +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test +- fixed bug in CollationWeights::getWeightRanges() + exposed by new data and CollationTest::TestRootElements + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt56l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd ~/svn.icu/trunk/dbg/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** LayoutEngine script information + +* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, + because the layout engine was deprecated in ICU 54. + Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java + to write lines that we used to add manually. + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + + (It also generates ScriptRunData.cpp, which is no longer needed.) + + It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages + (a plain text file) + which maps ICU versions to the numbers of script/language constants + that were added then. + (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) + + The generated files have a current copyright date and "@deprecated" statement. + +* Review changes, fix Java tool if necessary, and copy to ICU4C + cd ~/svn.icu4j/trunk/src + meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools & ICU tools are checked in + http://www.unicode.org/utility/trac/log/trunk/unicodetools + http://bugs.icu-project.org/trac/log/tools/trunk + +---------------------------------------------------------------------------- *** + +Unicode 7.0 update for ICU 54 + +http://www.unicode.org/review/pri271/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/versions/beta-7.0.0.html#notable_issues +http://www.unicode.org/reports/tr44/tr44-13.html + +*** ICU Trac + +- ticket 10821: Unicode 7.0, UCA 7.0 +- C++ branches/markus/uni70 at r35584 from trunk at r35580 +- Java branches/markus/uni70 at r35587 from trunk at r35545 + +*** CLDR Trac + +- ticket 7195: UCA 7.0 CLDR root collation +- branches/markus/uni70 at r10062 from trunk at r10061 + +- ticket 6762: script metadata for Unicode 7.0 new scripts + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- only for manual diffs: remove version suffixes from the file names + ~/unidata/uni70/20140403$ ../../desuffixucd.py . + (see https://sites.google.com/site/unicodetools/inputdata) +- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Restore TODO diffs in source/data/unidata/UCARules.txt + cd $ICU_SRC_DIR + meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt +- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt + +- also: from http://unicode.org/Public/security/7.0.0/ download new + confusables.txt & confusablesWholeScript.txt + and copy to $ICU_ROOT/src/source/data/unidata/ + +* initial preparseucd.py changes +- remove new Unicode scripts from the + only-in-ISO-15924 list according to the error message: + ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', + 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', + 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] + from _scripts_only_in_iso15924 + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- NamesList.txt now has a heading with a non-ASCII character + + keep ppucd.txt in platform charset, rather than changing tool/test parsers + + escape non-ASCII characters in heading comments +- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 + + get the copyright from the first file whose copyright line contains the current year + +* PropertyValueAliases.txt changes +- 32 new Block (blk) values: + blk; Bassa_Vah ; Bassa_Vah + blk; Caucasian_Albanian ; Caucasian_Albanian + blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers + blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended + blk; Duployan ; Duployan + blk; Elbasan ; Elbasan + blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended + blk; Grantha ; Grantha + blk; Khojki ; Khojki + blk; Khudawadi ; Khudawadi + blk; Latin_Ext_E ; Latin_Extended_E + blk; Linear_A ; Linear_A + blk; Mahajani ; Mahajani + blk; Manichaean ; Manichaean + blk; Mende_Kikakui ; Mende_Kikakui + blk; Modi ; Modi + blk; Mro ; Mro + blk; Myanmar_Ext_B ; Myanmar_Extended_B + blk; Nabataean ; Nabataean + blk; Old_North_Arabian ; Old_North_Arabian + blk; Old_Permic ; Old_Permic + blk; Ornamental_Dingbats ; Ornamental_Dingbats + blk; Pahawh_Hmong ; Pahawh_Hmong + blk; Palmyrene ; Palmyrene + blk; Pau_Cin_Hau ; Pau_Cin_Hau + blk; Psalter_Pahlavi ; Psalter_Pahlavi + blk; Shorthand_Format_Controls ; Shorthand_Format_Controls + blk; Siddham ; Siddham + blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers + blk; Sup_Arrows_C ; Supplemental_Arrows_C + blk; Tirhuta ; Tirhuta + blk; Warang_Citi ; Warang_Citi + -> add to uchar.h + use long property names for enum constants + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 28 new Joining_Group (jg) values: + jg ; Manichaean_Aleph ; Manichaean_Aleph + jg ; Manichaean_Ayin ; Manichaean_Ayin + jg ; Manichaean_Beth ; Manichaean_Beth + jg ; Manichaean_Daleth ; Manichaean_Daleth + jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh + jg ; Manichaean_Five ; Manichaean_Five + jg ; Manichaean_Gimel ; Manichaean_Gimel + jg ; Manichaean_Heth ; Manichaean_Heth + jg ; Manichaean_Hundred ; Manichaean_Hundred + jg ; Manichaean_Kaph ; Manichaean_Kaph + jg ; Manichaean_Lamedh ; Manichaean_Lamedh + jg ; Manichaean_Mem ; Manichaean_Mem + jg ; Manichaean_Nun ; Manichaean_Nun + jg ; Manichaean_One ; Manichaean_One + jg ; Manichaean_Pe ; Manichaean_Pe + jg ; Manichaean_Qoph ; Manichaean_Qoph + jg ; Manichaean_Resh ; Manichaean_Resh + jg ; Manichaean_Sadhe ; Manichaean_Sadhe + jg ; Manichaean_Samekh ; Manichaean_Samekh + jg ; Manichaean_Taw ; Manichaean_Taw + jg ; Manichaean_Ten ; Manichaean_Ten + jg ; Manichaean_Teth ; Manichaean_Teth + jg ; Manichaean_Thamedh ; Manichaean_Thamedh + jg ; Manichaean_Twenty ; Manichaean_Twenty + jg ; Manichaean_Waw ; Manichaean_Waw + jg ; Manichaean_Yodh ; Manichaean_Yodh + jg ; Manichaean_Zayin ; Manichaean_Zayin + jg ; Straight_Waw ; Straight_Waw + -> uchar.h & UCharacter.JoiningGroup +- 23 new Script (sc) values: + sc ; Aghb ; Caucasian_Albanian + sc ; Bass ; Bassa_Vah + sc ; Dupl ; Duployan + sc ; Elba ; Elbasan + sc ; Gran ; Grantha + sc ; Hmng ; Pahawh_Hmong + sc ; Khoj ; Khojki + sc ; Lina ; Linear_A + sc ; Mahj ; Mahajani + sc ; Mani ; Manichaean + sc ; Mend ; Mende_Kikakui + sc ; Modi ; Modi + sc ; Mroo ; Mro + sc ; Narb ; Old_North_Arabian + sc ; Nbat ; Nabataean + sc ; Palm ; Palmyrene + sc ; Pauc ; Pau_Cin_Hau + sc ; Perm ; Old_Permic + sc ; Phlp ; Psalter_Pahlavi + sc ; Sidd ; Siddham + sc ; Sind ; Khudawadi + sc ; Tirh ; Tirhuta + sc ; Wara ; Warang_Citi + -> uscript.h (many were added before) + comment "Mende Kikakui" for USCRIPT_MENDE + add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2; \3 +- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2012-11-01) + Ahom 338 Ahom + Hatr 127 Hatran + Mult 323 Multani + (added 2013-10-12) + Modi 324 Modi + Pauc 263 Pau Cin Hau + Sidd 302 Siddham + -> uscript.h (some overlap with additions from Unicode) + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2; \3 + -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt + +* generate normalization data files +- cd $ICU_ROOT/dbg +- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +- UNIDATA=$ICU_SRC_DIR/source/data/unidata +- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource +- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + +~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) +# Location of the ICU source tree. +set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) + +~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c +~/svn.icutools/trunk/dbg/unicode/c$ make + +* genprops work +- new code point range for Joining_Group values: 10AC0..10AFF Manichaean + + add second array of Joining_Group values for at most 10800..10FFF + icutools: unicode/c/genprops/bidipropsbuilder.cpp + icu: source/common/ubidi_props.h/.c/_data.h + icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..7.0: U+2260, U+226E, U+226F +- nothing new in 7.0, no test file to update + +* run & fix ICU4C tests + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt53l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + ICUDT=icudt54b + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cd ~/svn.icu/uni70/dbg/data/out/icu4j + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr +- refresh ICU4J + ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA + +- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ +- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) +- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ +- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA +- output files are in ~/svn.unitools/Generated/uca/7.0.0/ +- review data; compare files, use blankweights.sed or similar + ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt +- cd ~/svn.unitools/Generated/uca/7.0.0/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") + cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) + cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ICUDT=icudt54b + ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test +- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors +- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch + ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +* run & fix ICU4J tests + +*** LayoutEngine script information + +(For details see the Unicode 5.2 change log below.) + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + (It also generates ScriptRunData.cpp, which is no longer needed.) + + The generated files have a current copyright date and "@stable" statement. + ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java + for "born stable" Unicode API constants, and to stop parsing ICU version numbers + which may not contain dots any more. + +- diff current <icu>/source/layout files vs. generated ones + ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + review and manually merge desired changes; + fix gratuitous changes, incorrect @draft/@stable and missing aliases; + Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. +- if you just copy the above files, then + fix mixed line endings, review the diffs as above and restore changes to API tags etc.; + manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Unicode 6.3 update + +http://www.unicode.org/review/pri249/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/versions/beta-6.3.0.html#notable_issues +http://www.unicode.org/reports/tr44/tr44-11.html + +*** ICU Trac + +- ticket 10128: update ICU to Unicode 6.3 beta +- ticket 10168: update ICU to Unicode 6.3 final +- C++ branches/markus/uni63 at r33552 from trunk at r33551 +- Java branches/markus/uni63 at r33550 from trunk at r33553 + +- ticket 10142: implement Unicode 6.3 bidi algorithm additions + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD, UCA & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- modify preparseucd.py: + parse new file BidiBrackets.txt + with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyAliases.txt changes +- 1 new Enumerated Property + bpt ; Bidi_Paired_Bracket_Type + -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType + -> ubidi_props.h & .c & UBiDiProps.java + -> remember to write the max value at UBIDI_MAX_VALUES_INDEX + -> uprops.cpp + -> change ubidi.icu format version from 2.0 to 2.1 +- 1 new Miscellaneous Property + bpb ; Bidi_Paired_Bracket + -> uchar.h & UProperty.java + -> ppucd.h & .cpp + +* PropertyValueAliases.txt changes +- 3 Bidi_Paired_Bracket_Type (bpt) values: + bpt; c ; Close + bpt; n ; None + bpt; o ; Open + -> uchar.h & UCharacter.BidiPairedBracketType + -> ubidi_props.h & .c & UBiDiProps.java + -> change ubidi.icu format version from 2.0 to 2.1 +- 4 new Bidi_Class (bc) values: + bc ; FSI ; First_Strong_Isolate + bc ; LRI ; Left_To_Right_Isolate + bc ; RLI ; Right_To_Left_Isolate + bc ; PDI ; Pop_Directional_Isolate + -> uchar.h & UCharacterEnums.ECharacterDirection + -> until the bidi code gets updated, + Roozbeh suggests mapping the new bc values to ON (Other_Neutral) +- 3 new Word_Break (WB) values: + WB ; HL ; Hebrew_Letter + WB ; SQ ; Single_Quote + WB ; DQ ; Double_Quote + -> uchar.h & UCharacter.WordBreak + -> first time Word_Break numeric constants exceed 4 bits (now 17 values) +- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2012-10-16) + Aghb 239 Caucasian Albanian + Mahj 314 Mahajani + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> preparseucd.py _scripts_only_in_iso15924 + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + +* generate normalization data files +- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib +- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in +- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + +~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) +# Location of the ICU source tree. +set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) + +~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c +~/svn.icutools/trunk/dbg/unicode/c$ make + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.3: U+2260, U+226E, U+226F +- nothing new in 6.3, no test file to update + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt52l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b + ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr +- refresh ICU4J + ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files + +- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ +- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* test ICU, fix test code where necessary + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information +- skipped for Unicode 6.3: no new scripts + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Unicode 6.2 update + +http://www.unicode.org/review/pri230/ +http://www.unicode.org/versions/beta-6.2.0.html +http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 +http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values +http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol +http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols +http://www.unicode.org/reports/tr46/tr46-8.html IDNA +http://unicode.org/Public/idna/6.2.0/ + +*** ICU Trac + +- ticket 9515: Unicode 6.2: final ICU update + +- ticket 9514: UCA 6.2: fix UCARules.txt + +- ticket 9437: update ICU to Unicode 6.2 +- C++ branches/markus/uni62 at r32050 from trunk at r32041 +- Java branches/markus/uni62 at r32068 from trunk at r32066 + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +*** data files & enums & parser code + +* file preparation + +- download UCD, UCA & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- modify preparseucd.py: NamesList.txt is now in UTF-8 +- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- 1 new Line_Break (lb) value: + lb ; RI ; Regional_Indicator + -> uchar.h & UCharacter.LineBreak +- 1 new Word_Break (WB) value: + WB ; RI ; Regional_Indicator + -> uchar.h & UCharacter.WordBreak +- 1 new Grapheme_Cluster_Break (GCB) value: + GCB; RI ; Regional_Indicator + -> uchar.h & UCharacter.GraphemeClusterBreak + +* 3 new numeric values + The new value -1, which was really supposed to be NaN but that would have required + new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, + but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. + cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 + cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 + The two new values 216000 and 432000 require an addition to the encoding of numeric values. + cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 + cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 + -> uprops.h, uchar.c & UCharacterProperty.java + -> cucdtst.c & UCharacterTest.java + +* generate normalization data files +- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib +- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in +- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. +* build Unicode tools using CMake+make + +* generate core properties data files +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src +- in initial bootstrapping, change the UCA version + in source/data/unidata/FractionalUCA.txt to match the new Unicode version +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src +- rebuild ICU (make install) & tools + + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, + check if the UCA version in FractionalUCA.txt matches the new Unicode version + (see step above) +- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.2: U+2260, U+226E, U+226F +- nothing new in 6.2, no test file to update + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt50l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b + ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr +- refresh ICU4J + ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA + +- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ +- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* test ICU, fix test code where necessary + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information +- skipped for Unicode 6.2: no new scripts + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Future Unicode update + +Tools simplified since the Unicode 6.1 update. See +- https://icu.unicode.org/design/props/ppucd +- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 + +* Unicode version numbers +- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates + +* file preparation +- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: +- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- Script codes that are in ISO 15924 but not in Unicode are now listed in + preparseucd.py, in the _scripts_only_in_iso15924 variable. + If there are new ISO codes, then add them. + If Unicode adds some of them, then remove them from the .py variable. + +* UnicodeData.txt changes +- No more manual changes for CJK ranges for algorithmic names; + those are now written to ppucd.txt and genprops reads them from there. + +* generate core properties data files (makeprops.sh was deleted) +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src + +* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt +- it is now generated by preparseucd.py + +* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt +- it is now generated by preparseucd.py +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt + (can be in some subfolder) + +* generate normalization data files +- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib +- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) +* build Unicode tools using CMake+make + +* new way to call genuca (makeuca.sh was deleted) +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src + +---------------------------------------------------------------------------- *** + +Unicode 6.1 update + +*** ICU Trac + +- ticket 8995 final update to Unicode 6.1 +- ticket 8994 regenerate source/layout/CanonData.cpp + +- ticket 8961 support Unicode "Age" value *names* +- ticket 8963 support multiple character name aliases & types + +- ticket 8827 "update ICU to Unicode 6.1" +- C++ branches/markus/uni61 at r30864 from trunk at r30843 +- Java branches/markus/uni61 at r30865 from trunk at r30863 + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- icutools/unicode/makedefs.sh + + also review & update other definitions in that file, + e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l + +*** data files & enums & parser code + +* file preparation + +~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed +- This prepares both unidata and testdata files in respective output subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- 11 new block names: + Arabic_Extended_A + Arabic_Mathematical_Alphabetic_Symbols + Chakma + Meetei_Mayek_Extensions + Meroitic_Cursive + Meroitic_Hieroglyphs + Miao + Sharada + Sora_Sompeng + Sundanese_Supplement + Takri + -> add to uchar.h + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 1 new Joining_Group (jg) value: + Rohingya_Yeh + -> uchar.h & UCharacter.JoiningGroup +- 2 new Line_Break (lb) values: + CJ=Conditional_Japanese_Starter + HL=Hebrew_Letter + -> uchar.h & UCharacter.LineBreak +- 7 new scripts: + sc ; Cakm ; Chakma + sc ; Merc ; Meroitic_Cursive + sc ; Mero ; Meroitic_Hieroglyphs + sc ; Plrd ; Miao + sc ; Shrd ; Sharada + sc ; Sora ; Sora_Sompeng + sc ; Takr ; Takri + -> remove these from SyntheticPropertyValueAliases.txt + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2011-06-21) + Khoj 322 Khojki + Tirh 326 Tirhuta + and another one added 2011-12-09 + Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> SyntheticPropertyValueAliases.txt + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* UnicodeData.txt changes +- the last Unihan code point changes from U+9FCB to U+9FCC + search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) + + do change gennames.c + + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java + +* DerivedBidiClass.txt changes +- 2 new default-AL blocks: +# Arabic Extended-A: U+08A0 - U+08FF (was default-R) +# Arabic Mathematical Alphabetic Symbols: +# U+1EE00 - U+1EEFF (was default-R) +- 2 new default-R blocks: +# Meroitic Hieroglyphs: +# U+10980 - U+1099F +# Meroitic Cursive: U+109A0 - U+109FF + -> should be picked up by the explicit data in the file + +* NameAliases.txt changes +- from + # Each line has two fields + # First field: Code point + # Second field: Alias +- to + # Each line has three fields, as described here: + # + # First field: Code point + # Second field: Alias + # Third field: Type +- Also, the file previously allowed multiple aliases but only now does it + actually provide multiple, even multiple of the same type. For example, + FEFF;BYTE ORDER MARK;alternate + FEFF;BOM;abbreviation + FEFF;ZWNBSP;abbreviation +- This breaks our gennames parser, unames.icu data structure, and API. + Fix gennames to only pick up "correction" aliases. + New ticket #8963 for further changes. + +* run genpname/preparse.pl (on Linux) + + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname + + make sure that data.h is writable + + perl preparse.pl ~/svn.icu/trunk/src > out.txt + + preparse.pl shows no errors, out.txt Info and Warning lines look ok + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. +* build Unicode tools (at least genpname) using CMake+make + +* run genpname + (builds both pnames.icu and propname_data.h) +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource + +* build ICU (make install) +* build Unicode tools using CMake+make + +* update source/data/unidata/norm2/nfkc_cf.txt +- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt + +* update source/data/unidata/norm2/uts46.txt +- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt + to ~/svn.icu/tools/trunk/src/unicode/py +- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". +- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py +- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.1: U+2260, U+226E, U+226F +- nothing new in 6.1, no test file to update + +* generate core properties data files +- in initial bootstrapping, change the UCA version + in source/data/unidata/FractionalUCA.txt to match the new Unicode version +- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools + + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, + check if the UCA version in FractionalUCA.txt matches the new Unicode version + (see step above) +- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt49l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b + ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr +- refresh ICU4J + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* test ICU so far, fix test code where necessary +- temporarily ignore collation issues that look like UCA/UCD mismatches, + until UCA data is updated + +* UCA + +- get output from Mark's tools; look in + http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run makeuca.sh: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information + +(For details see the Unicode 5.2 change log below.) + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + (It also generates ScriptRunData.cpp, which is no longer needed.) + + The generated files have a current copyright date and "@draft" statement. + +- diff current <icu>/source/layout files vs. generated ones + ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + review and manually merge desired changes; + fix gratuitous changes, incorrect @draft and missing aliases; + Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. +- if you just copy the above files, then + fix mixed line endings, review the diffs as above and restore changes to API tags etc.; + manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +ICU 4.8 (no Unicode update, just new script codes) + +* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2010-12-21) + Afak 439 Afaka + Jurc 510 Jurchen + Mroo 199 Mro, Mru + Nshu 499 Nüshu + Shrd 319 Sharada, Śāradā + Sora 398 Sora Sompeng + Takr 321 Takri, Ṭākrī, Ṭāṅkrī + Tang 520 Tangut + Wole 480 Woleai + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> genpname/SyntheticPropertyValueAliases.txt + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* run genpname/preparse.pl (on Linux) + + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname + + make sure that data.h is writable + + perl preparse.pl ~/svn.icu/trunk/src > out.txt + + preparse.pl shows no errors, out.txt Info and Warning lines look ok + +* rebuild Unicode tools (at least genpname) using make +- You might first need to "make install" ICU so that the tools build can pick + up the new definitions from the installed header files. + +* run genpname + (builds both pnames.icu and propname_data.h) +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource +- rebuild ICU & tools + +* run genprops +- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 +- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 +- rebuild ICU & tools + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b + ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b + ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b +- refresh ICU4J + ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b + +* should have updated the layout engine script codes but forgot + +---------------------------------------------------------------------------- *** + +Unicode 6.0 update + +*** related ICU Trac tickets + +7264 Unicode 6.0 Update + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo + +*** data files & enums & parser code + +* file preparation + +~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed +- This now prepares both unidata and testdata files in respective output subfolders. + +* PropertyAliases.txt changes +- new Script_Extensions property defined in the new ScriptExtensions.txt file + but not listed in PropertyAliases.txt; reported to unicode.org; + -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt + scx; Script_Extensions + -> uchar.h with new UProperty section + -> com.ibm.icu.lang.UProperty, parallel with uchar.h + +* PropertyValueAliases.txt changes +- 12 new block names: + Alchemical_Symbols + Bamum_Supplement + Batak + Brahmi + CJK_Unified_Ideographs_Extension_D + Emoticons + Ethiopic_Extended_A + Kana_Supplement + Mandaic + Miscellaneous_Symbols_And_Pictographs + Playing_Cards + Transport_And_Map_Symbols + -> add to uchar.h + -> add to UCharacter.UnicodeBlock + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- Joining_Group (jg) values: + Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias + -> uchar.h & UCharacter.JoiningGroup +- 3 new scripts: + sc ; Batk ; Batak + sc ; Brah ; Brahmi + sc ; Mand ; Mandaic + -> remove these from SyntheticPropertyValueAliases.txt + -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2009-11-11..2010-07-18) + Bass 259 Bassa Vah + Dupl 755 Duployan shortand + Elba 226 Elbasan + Gran 343 Grantha + Kpel 436 Kpelle + Loma 437 Loma + Mend 438 Mende + Merc 101 Meroitic Cursive + Narb 106 Old North Arabian + Nbat 159 Nabataean + Palm 126 Palmyrene + Sind 318 Sindhi + Wara 262 Warang Citi + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> SyntheticPropertyValueAliases.txt + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- ISO 15924 name change + Mero 100 Meroitic Hieroglyphs (was Meroitic) + -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC +- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt + +* UnicodeData.txt changes +- new CJK block: + 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; + 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; + -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion + +* build Unicode tools using CMake+make + +* run genpname/preparse.pl (on Linux) + + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname + + make sure that data.h is writable + + perl preparse.pl ~/svn.icu/trunk/src > out.txt + + preparse.pl shows no errors, out.txt Info and Warning lines look ok + +* rebuild Unicode tools (at least genpname) using make +- You might first need to "make install" ICU so that the tools build can pick + up the new definitions from the installed header files. + +* run genpname +- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in +- rebuild ICU & tools + +* update source/data/unidata/norm2/nfkc_cf.txt +- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt + +* update source/data/unidata/norm2/uts46.txt +- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt + to ~/svn.icu/tools/trunk/src/unicode/py +- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values +- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py +- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0: U+2260, U+226E, U+226F + +* generate core properties data files +- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools +- run makeuca.sh so that genuca picks up the new nfc.nrm: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools + +* implement new Script_Extensions property (provisional) +- parser & generator: genprops & uprops.icu +- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp +- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java + +* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 +- (one-time change) +- genbidi/gencase/genprops tools changes +- re-run makeprops.sh (see above) +- UCharacterProperty.java, UCharacterTypeIterator.java, + UBiDiProps.java, UCaseProps.java, and several others with minor changes; + UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt45l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b + echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b + ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr +- refresh ICU4J + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* un-hardcode normalization skippable (NF*_Inert) test data +- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools + +* copy updated break iterator test files +- now handled by early ucdcopy.py and + copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata + (old instructions: + copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt + to ~/svn.icu/trunk/src/source/test/testdata) +- they are not used in ICU4J + +* UCA + +- get output from Mark's tools; look in + http://www.unicode.org/~book/incoming/mark/uca6.0.0/ + http://www.macchiato.com/unicode/utc/additional-uca-files + http://www.unicode.org/Public/UCA/6.0.0/ + http://www.unicode.org/~mdavis/uca/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt +- update Han-implicit ranges for new CJK extensions: + swapCJK() in ucol.cpp & ImplicitCEGenerator.java +- genuca: allow bytes 02 for U+FFFE, new merge-sort character; + do not add it into invuca so that tailoring primary-after an ignorable works +- genuca: permit space between [variable top] bytes +- ucol.cpp: treat noncharacters like unassigned rather than ignorable +- run makeuca.sh: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b +- update (ICU)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools +- run all tests with the *_SHORT.txt or the full files (the full ones have comments) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information + +(For details see the Unicode 5.2 change log below.) + +* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, +ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates +ScriptRunData.cpp, which is no longer needed.) + +The generated files have a current copyright date and "@draft" statement. + +* copy the above files into <icu>/source/layout, replacing the old files. +* fix mixed line endings +* review the diffs and fix incorrect @draft and missing aliases; + Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. +* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +---------------------------------------------------------------------------- *** + +Unicode 5.2 update + +*** related ICU Trac tickets + +7084 Unicode 5.2 + +7167 verify collation bytes +7235 Java test NAME_ALIAS +7236 Java DerivedCoreProperties.txt test +7237 Java BidiTest.txt +7238 UTrie2 in core unidata +7239 test for tailoring gaps +7240 Java fix CollationMiscTest +7243 update layout engine for Unicode 5.2 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- configure.in & configure +- update ucdVersion in gennames.c if an algorithmic range changes + +*** data files & enums & parser code + +* file preparation + +python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata +- includes finding files regardless of version numbers, + copying them, and performing the equivalent processing of the + ucdstrip and ucdmerge tools on the desired set of files + +* notes on changes +- PropertyAliases.txt + moved from numeric to enumerated: + ccc ; Canonical_Combining_Class + new string properties: + NFKC_CF ; NFKC_Casefold + Name_Alias; Name_Alias + new binary properties: + Cased ; Cased + CI ; Case_Ignorable + CWCF ; Changes_When_Casefolded + CWCM ; Changes_When_Casemapped + CWKCF ; Changes_When_NFKC_Casefolded + CWL ; Changes_When_Lowercased + CWT ; Changes_When_Titlecased + CWU ; Changes_When_Uppercased + new CJK Unihan properties (not supported by ICU) +- PropertyValueAliases.txt + new block names + new scripts + one script code change: + sc ; Qaai ; Inherited + -> + sc ; Zinh ; Inherited ; Qaai + new Line_Break (lb) value: + lb ; CP ; Close_Parenthesis + new Joining_Group (jg) values: Farsi_Yeh, Nya + other new values: + ccc; 214; ATA ; Attached_Above +- DerivedBidiClass.txt + new default-R range: U+1E800 - U+1EFFF +- UnicodeData.txt + all of the ISO comments are gone + new CJK block end: + 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> + new CJK block: + 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; + 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; + +* genpname +- run preparse.pl + + cd \svn\icuproj\icu\trunk\source\tools\genpname + + make sure that data.h is writable + + perl preparse.pl \svn\icuproj\icu\trunk > out.txt + + preparse.pl complains with errors like the following: + Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. + This is because ICU 4.0 had scripts from ISO 15924 which are now + added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt + and PropertyValueAliases.txt. + -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: + Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt + + preparse.pl complains with errors about block names missing from uchar.h; add them + +* uchar.h & uscript.h & uprops.h & uprops.c & genprops +- new block & script values + + 26 new blocks + copy new blocks from Blocks.txt + MS VC++ 2008 regular expression: + find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" + replace with " UBLOCK_\3 = 172, /*[\1]*/" + + several new script values already added in ICU 4.0 for ISO 15924 coverage + (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) + + 3 new script values added for ISO 15924 and Unicode 5.2 coverage + + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) + (added to SyntheticPropertyValueAliases.txt) +- new Joining Group (JG) values: Farsi_Yeh, Nya +- new Line_Break (lb) value: + lb ; CP ; Close_Parenthesis + +* hardcoded Unihan range end/limit +- Unihan range end moves from 9FC3 to 9FCB + search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) + + do change gennames.c + +* Compare definitions of new binary properties with what we used to use + in algorithms, to see if the definitions changed. +- Verified that definitions for Cased and Case_Ignorable are unchanged. + The gencase tool now parses the newly public Case_Ignorable values + in case the definition changes in the future. + +* uchar.c & uprops.h & uprops.c & genprops +- new numeric values that didn't exist in Unicode data before: + 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 + the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, + therefore redesign the encoding of numeric types and values for formatVersion 6; + design for simple numbers up to at least 144 ("one gross"), + large values up to at least 10^20, + and fractions with numerators -1..17 and denominators 1..16 + to cover current and expected future values + (e.g., more Han numeric values, Meroitic twelfths) + +* reimplement Hangul_Syllable_Type for new Jamo characters +- the old code assumed that all Jamo characters are in the 11xx block +- Unicode 5.2 fills holes there and adds new Jamo characters in + A960..A97F; Hangul Jamo Extended-A + and in + D7B0..D7FF; Hangul Jamo Extended-B +- Hangul_Syllable_Type can be trivially derived from a subset of + Grapheme_Cluster_Break values + +* build Unicode data source code for hardcoding core data +C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data + +ICU data make path is \svn\icuproj\icu\trunk\source\data\ +ICU root path is \svn\icuproj\icu\trunk +Information: cannot find "ucmlocal.mk". Not building user-additional converter files. +Information: cannot find "brklocal.mk". Not building user-additional break iterator files. +Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. +Information: cannot find "collocal.mk". Not building user-additional resource bundle files. +Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. +Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. +Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. +Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. +Creating data file for Unicode Property Names +Creating data file for Unicode Character Properties +Creating data file for Unicode Case Mapping Properties +Creating data file for Unicode BiDi/Shaping Properties +Creating data file for Unicode Normalization +Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" +Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" + +- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common + and rebuild the common library + +*** UCA + +- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools +- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools +[ Begin obsolete instructions: + Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. + - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py + on Windows: + python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt + python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt + End obsolete instructions] +- run all tests with the *_SHORT.txt or the full files (the full ones have comments) + not just the *_STUB.txt files +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +*** Implement Cased & Case_Ignorable properties +- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() +- Problem: These properties should be disjoint, but aren't +- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not +- change ucase.icu to be able to store any combination of Cased and Case_Ignorable + +*** Implement Changes_When_Xyz properties +- without stored data + +*** Implement Name_Alias property +- add it as another name field in unames.icu +- make it available via u_charName() and UCharNameChoice and +- consider it in u_charFromName() + +*** Break iterators + +* Update break iterator rules to new UAX versions and new property values +* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary + +*** new BidiTest file +- review format and data +- copy BidiTest.txt to source/test/testdata +- write test code using this data +- fix ICU code where it fails the conformance test + +*** Java +- generally, find and update code corresponding to C/C++ +- UCharacter.UnicodeBlock constants: + a) add an _ID integer per new block, update COUNT + b) add a class instance per new block + Visual Studio regex: + find UBLOCK_{[^ ]+} = [0-9]+, {/.+} + replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() + +- port test changes to Java + +*** LayoutEngine script information + +(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) + +* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, +ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates +ScriptRunData.cpp, which is no longer needed.) + +The generated files have a current copyright date and "@draft" statement. + +-> Eric Mader wrote in email on 20090930: + "I think the tool has been modified to update @draft to @stable for + older scripts and to add @draft for new scripts. + (I worked with an intern on this last year.) + You should check the output after you run it." + +* copy the above files into <icu>/source/layout, replacing the old files. +* fix mixed line endings +* review the diffs and fix incorrect @draft and missing aliases +* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp +and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) + +-> Eric Mader wrote in email on 20090930: + "This is just a matter of making sure that all the per-script tables have + entries for any new scripts that were added. + If any new Indic characters were added, then the class tables in + IndicClassTables.cpp should be updated to reflect this. + John Emmons should know how to do this if it's required." + +* rebuild the layout and layoutex libraries. + +*** Documentation +- Update User Guide + + Jamo_Short_Name, sfc->scf, binary property value aliases + +---------------------------------------------------------------------------- *** + +Unicode 5.1 update + +*** related ICU Trac tickets + +5696 Update to Unicode 5.1 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- configure.in & configure +- update ucdVersion in gennames.c if an algorithmic range changes + +*** data files & enums & parser code + +* file preparation +- ucdstrip: + DerivedCoreProperties.txt + DerivedNormalizationProps.txt + NormalizationTest.txt + PropList.txt + Scripts.txt + GraphemeBreakProperty.txt + SentenceBreakProperty.txt + WordBreakProperty.txt +- ucdstrip and ucdmerge: + EastAsianWidth.txt + LineBreak.txt + +* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) +copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ +copy 5.1.0\ucd\Blocks.txt ..\unidata\ +copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ +copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ +copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ +copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ +copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ +copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ +copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ +copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ +copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ +copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ +copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ + +ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt +ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt +ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt +ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt +ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt +ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt +ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt +ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt +ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt +ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt + +* genpname +- run preparse.pl + + cd \svn\icuproj\icu\uni51\source\tools\genpname + + make sure that data.h is writable + + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt + + preparse.pl complains with errors like the following: + Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. + This is because ICU 3.8 had scripts from ISO 15924 which are now + added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt + and PropertyValueAliases.txt. + -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: + Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii + + PropertyValueAliases.txt now explicitly contains values for boolean properties: + N/Y, No/Yes, F/T, False/True + -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. + It will use further values from the file if present. + +* uchar.h & uscript.h & uprops.h & uprops.c & genprops +- new block & script values + + 17 new blocks + + 11 new script values already added in ICU 3.8 for ISO 15924 coverage + (removed from SyntheticPropertyValueAliases.txt) + + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) + (added to SyntheticPropertyValueAliases.txt) +- uprops.icu (uprops.h) only provides 7 bits for script codes. + In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. + There is none above 127 yet which is the script code for an + assigned Unicode character, so ICU 4.0 uprops.icu does not store any + script code values greater than 127. + However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 + in a parallel bit field, and that overflows now. + Also, future values >=128 would be incompatible anyway. + uprops.h is modified to move around several of the bit fields + in the properties vector words, and now uses 8 bits for the script code. + Two other bit fields also grow to accommodate future growth: + Block (current count: 172) grows from 8 to 9 bits, + and Word_Break grows from 4 to 5 bits. +- renamed property Simple_Case_Folding (sfc->scf) + + nothing to be done: handled as normal alias +- new property JSN Jamo_Short_Name + + no new API: only contributes to the Name property +- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark +- new Joining Group (JG) value: Burushashki_Yeh_Barree +- new Sentence_Break (SB) values: + SB ; CR ; CR + SB ; EX ; Extend + SB ; LF ; LF + SB ; SC ; SContinue +- new Word_Break (WB) values: + WB ; CR ; CR + WB ; Extend ; Extend + WB ; LF ; LF + WB ; MB ; MidNumLet + +* Further changes in the 2008-02-29 update: +- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP + because they should not normally be invisible. +- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) +- new Grapheme_Cluster_Break (GCB) value: PP=Prepend +- new Word_Break (WB) value: NL=Newline + +* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) +- Unihan range end moves from 9FBB to 9FC3 + search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) + + do change gennames.c + +* build Unicode data source code for hardcoding core data +C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data + +ICU data make path is \svn\icuproj\icu\uni51\source\data\ +ICU root path is \svn\icuproj\icu\uni51 +Information: cannot find "ucmlocal.mk". Not building user-additional converter files. +Information: cannot find "brklocal.mk". Not building user-additional break iterator files. +Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. +Information: cannot find "collocal.mk". Not building user-additional resource bundle files. +Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. +Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. +Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. +Creating data file for Unicode Character Properties +Creating data file for Unicode Case Mapping Properties +Creating data file for Unicode BiDi/Shaping Properties +Creating data file for Unicode Normalization +Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" +Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" + +- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common + and rebuild the common library + +*** Break iterators + +* Update break iterator rules to new UAX versions and new property values + +*** UCA + +* update FractionalUCA.txt and UCARules.txt with new canonical closure + +*** Test suites +- Test that APIs using Unicode property value aliases (like UnicodeSet) + support all of the boolean values N/Y, No/Yes, F/T, False/True + -> TestBinaryValues() tests in both cintltst and intltest + +*** LayoutEngine script information +* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, +ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates +ScriptRunData.cpp, which is no longer needed.) + +The generated files have a current copyright date and "@draft" statement. + +* copy the above files into <icu>/source/layout, replacing the old files. + +Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp +and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) + +* rebuild the layout and layoutex libraries. + +*** Documentation +- Update User Guide + + Jamo_Short_Name, sfc->scf, binary property value aliases + +---------------------------------------------------------------------------- *** + +Unicode 5.0 update + +*** related Jitterbugs + +5084 RFE: Update to Unicode 5.0 + +*** data files & enums & parser code + +* file preparation +- ucdstrip: + DerivedCoreProperties.txt + DerivedNormalizationProps.txt + NormalizationTest.txt + PropList.txt + Scripts.txt + GraphemeBreakProperty.txt + SentenceBreakProperty.txt + WordBreakProperty.txt +- ucdstrip and ucdmerge: + EastAsianWidth.txt + LineBreak.txt + +* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) +copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ +copy 5.0.0\ucd\Blocks.txt ..\unidata\ +copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ +copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ +copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ +copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ +copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ +copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ +copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ +copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ +copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ +copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ +copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ + +ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt +ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt +ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt +ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt +ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt +ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt +ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt +ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt +ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt +ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt + +* update FractionalUCA.txt and UCARules.txt with new canonical closure + +* genpname +- run preparse.pl + + make sure that data.h is writable + + perl preparse.pl \cvs\oss\icu > out.txt + +* uchar.h & uscript.h & uprops.h & uprops.c & genprops +- new block & script values + + script values already added in ICU 3.6 because all of ISO 15924 is now covered + +* build Unicode data source code for hardcoding core data +C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data + +ICU data make path is \cvs\oss\icu\source\data\ +ICU root path is \cvs\oss\icu +Information: cannot find "ucmlocal.mk". Not building user-additional converter files. +[etc.] +Creating data file for Unicode Character Properties +Creating data file for Unicode Case Mapping Properties +Creating data file for Unicode BiDi/Shaping Properties +Creating data file for Unicode Normalization +Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" +Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" + +- copy the .c source files to C:\cvs\oss\icu\source\common + and rebuild the common library + +*** Unicode version numbers +- makedata.mak +- uchar.h +- configure.in + +*** LayoutEngine script information +* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, +ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates +ScriptRunData.cpp, which is no longer needed.) + +The generated files have a current copyright date and "@draft" statement. + +* copy the above files into <icu>/source/layout, replacing the old files. + +Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp +and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) + +* rebuild the layout and layoutex libraries. + +---------------------------------------------------------------------------- *** + +Unicode 4.1 update + +*** related Jitterbugs + +4332 RFE: Update to Unicode 4.1 +4157 RBBI, TR29 4.1 updates + +*** data files & enums & parser code + +* file preparation +- ucdstrip: + DerivedCoreProperties.txt + DerivedNormalizationProps.txt + NormalizationTest.txt + GraphemeBreakProperty.txt + SentenceBreakProperty.txt + WordBreakProperty.txt +- ucdstrip and ucdmerge: + EastAsianWidth.txt + LineBreak.txt + +* add new files to the repository + GraphemeBreakProperty.txt + SentenceBreakProperty.txt + WordBreakProperty.txt + +* update FractionalUCA.txt and UCARules.txt with new canonical closure + +* genpname +- handle new enumerated properties in sub read_uchar +- run preparse.pl + +* uchar.h & uscript.h & uprops.h & uprops.c & genprops +- new binary properties + + Pattern_Syntax + + Pattern_White_Space +- new enumerated properties + + Grapheme_Cluster_Break + + Sentence_Break + + Word_Break +- new block & script & line break values + +* gencase +- case-ignorable changes + see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods + now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk + +*** Unicode version numbers +- makedata.mak +- uchar.h +- configure.in + +*** tests +- verify that u_charMirror() round-trips +- test all new properties and some new values of old properties + +*** other code + +* hardcoded Unihan range end/limit +- Unihan range end moves from 9FA5 to 9FBB + search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) + + do not modify BOCU/BOCSU code because that would change the encoding + and break binary compatibility! + + similarly, do not change the GB 18030 range data (ucnvmbcs.c), + NamePrepProfile.txt + + ignore trietest.c: test data is arbitrary + + ignore tstnorm.cpp: test optimization, not important + + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF + + do change line_th.txt and word_th.txt + by replacing hardcoded ranges with the new property values + + do change gennames.c + +source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 +source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 +source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, + +* case mappings +- compare new special casing context conditions with previous ones + see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods + +* genpname +- consider storing only the short name if it is the same as the long name + +*** other reviews +- UAX #29 changes (grapheme/word/sentence breaks) +- UAX #14 changes (line breaks) +- Pattern_Syntax & Pattern_White_Space + +---------------------------------------------------------------------------- *** + +Unicode 4.0.1 update + +*** related Jitterbugs + +3170 RFE: Update to Unicode 4.0.1 +3171 Add new Unicode 4.0.1 properties +3520 use Unicode 4.0.1 updates for break iteration + +*** data files & enums & parser code + +* file preparation +- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt +- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt + +* file fixes +- fix UnicodeData.txt general categories of Ethiopic digits Nd->No + according to PRI #26 + http://www.unicode.org/review/resolved-pri.html#pri26 +- undone again because no corrigendum in sight; + instead modified tests to not check consistency on this for Unicode 4.0.1 + +* ucdterms.txt +- update from http://www.unicode.org/copyright.html + formatted for plain text + +* uchar.h & uprops.h & uprops.c & genprops +- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed +- add U_LB_INSEPARABLE due to a spelling fix + + put short name comment only on line with new constant + for genpname perl script parser +- new binary properties + + STerm + + Variation_Selector + +* genpname +- fix genpname perl script so that it doesn't choke on more than 2 names per property value +- perl script: correctly calculate the maximum number of fields per row + +* uscript.h +- new script code Hrkt=Katakana_Or_Hiragana + +* gennorm.c track changes in DerivedNormalizationProps.txt +- "FNC" -> "FC_NFKC" +- single field "NFD_NO" -> two fields "NFD_QC; N" etc. + +* genprops/props2.c track changes in DerivedNumericValues.txt +- changed from 3 columns to 2, dropping the numeric type + + assume that the type is always numeric for Han characters, + and that only those are added in addition to what UnicodeData.txt lists + +*** Unicode version numbers +- makedata.mak +- uchar.h +- configure.in + +*** tests +- update test of default bidi classes according to PRI #28 + /tsutil/cucdtst/TestUnicodeData + http://www.unicode.org/review/resolved-pri.html#pri28 +- bidi tests: change exemplar character for ES depending on Unicode version +- change hardcoded expected property values where they change + +*** other code + +* name matching +- read UCD.html + +* scripts +- use new Hrkt=Katakana_Or_Hiragana + +* ZWJ & ZWNJ +- are now part of combining character sequences +- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ |