summaryrefslogtreecommitdiffstats
path: root/intl/icu/source/data/unidata/changes.txt
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-19 01:47:29 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-19 01:47:29 +0000
commit0ebf5bdf043a27fd3dfb7f92e0cb63d88954c44d (patch)
treea31f07c9bcca9d56ce61e9a1ffd30ef350d513aa /intl/icu/source/data/unidata/changes.txt
parentInitial commit. (diff)
downloadfirefox-esr-0ebf5bdf043a27fd3dfb7f92e0cb63d88954c44d.tar.xz
firefox-esr-0ebf5bdf043a27fd3dfb7f92e0cb63d88954c44d.zip
Adding upstream version 115.8.0esr.upstream/115.8.0esr
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'intl/icu/source/data/unidata/changes.txt')
-rw-r--r--intl/icu/source/data/unidata/changes.txt5547
1 files changed, 5547 insertions, 0 deletions
diff --git a/intl/icu/source/data/unidata/changes.txt b/intl/icu/source/data/unidata/changes.txt
new file mode 100644
index 0000000000..9345f26bf5
--- /dev/null
+++ b/intl/icu/source/data/unidata/changes.txt
@@ -0,0 +1,5547 @@
+* Copyright (C) 2016 and later: Unicode, Inc. and others.
+* License & terms of use: http://www.unicode.org/copyright.html
+* Copyright (C) 2004-2016, International Business Machines
+* Corporation and others. All Rights Reserved.
+*
+* file name: changes.txt
+* encoding: US-ASCII
+* tab size: 8 (not used)
+* indentation:4
+*
+* created on: 2004may06
+* created by: Markus W. Scherer
+
+* change log for Unicode updates
+
+For an overview, see https://unicode-org.github.io/icu/processes/unicode-update
+
+Notes:
+
+This log includes several command lines as used in the update process.
+Some of them include a console prompt with the present working directory (pwd) followed by a $ sign.
+Use a console window that is set to that directory, or cd to there,
+and then paste the command that follows the $ sign.
+
+Most command lines use environment variables to make them more portable across versions
+and machine configurations. When you set up a console window, copy & paste the `export` commands
+from near the top of the current section before pasting tool command lines.
+Adjust the environment variables to the current version and your machine setup.
+(The command lines are currently as used on Linux.)
+
+---------------------------------------------------------------------------- ***
+
+* New ISO 15924 script codes
+
+Normally, add new script codes as part of a Unicode update.
+See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums
+and see the change logs below.
+
+---------------------------------------------------------------------------- ***
+
+CLDR 43 root collation update for ICU 73
+
+Partial update only for the root collation.
+See
+- https://unicode-org.atlassian.net/browse/CLDR-15946
+ Treat quote marks as equivalent when strength=UCOL_PRIMARY
+- https://github.com/unicode-org/cldr/pull/2691
+ CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks
+- https://github.com/unicode-org/cldr/pull/2833
+ CLDR-15946 make fancy quotes secondary-different from each other
+
+The related changes to tailorings were already integrated in an earlier PR for
+https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS.
+
+This update is for the root collation,
+which is handled by different tools than the locale data updates.
+
+* Command-line environment setup
+
+export UNICODE_DATA=~/unidata/uni15/20220830
+export CLDR_SRC=~/cldr/uni/src
+export ICU_ROOT=~/icu/uni
+export ICU_SRC=$ICU_ROOT/src
+export ICUDT=icudt73b
+export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Configure: Build Unicode data for ICU4J
+ cd $ICU_ROOT/dbg/icu4c
+ ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
+
+* Bazel build process
+
+See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
+for an overview and for setup instructions.
+
+Consider running `bazelisk --version` outside of the $ICU_SRC folder
+to find out the latest `bazel` version, and
+copying that version number into the $ICU_SRC/.bazeliskrc config file.
+(Revert if you find incompatibilities, or, better, update our build & config files.)
+
+* generate data files
+
+- remember to define the environment variables
+ (see the start of the section for this Unicode version)
+- cd $ICU_SRC
+- optional but not necessary:
+ bazelisk clean
+ or even
+ bazelisk clean --expunge
+- build/bootstrap/generate new files:
+ icu4c/source/data/unidata/generate.sh
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools,
+ and a tool-tailored version goes into CLDR, see
+ https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- generate data files, as above (generate.sh), now to pick up new collation data
+- rebuild ICU4C (make clean, make check, as usual)
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
+ you need to reconfigure with unicore data; see the "configure" line above.
+ output:
+ ...
+ make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+- new for ICU 73: also copy the binary data files directly into the ICU4J tree
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** merge the Unicode update branch back onto the main branch
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- if there is a merge conflict in icudata.jar, here is one way to deal with it:
+ + remove icudata.jar from the commit so that rebasing is trivial
+ + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
+ + ~/icu/uni/src$ git commit -a --amend
+ + switch to main, pull updates, switch back to the dev branch
+ + ~/icu/uni/src$ git rebase main
+ + rebuild icudata.jar
+ + ~/icu/uni/src$ git commit -a --amend
+ + ~/icu/uni/src$ git push -f
+- make sure that changes to Unicode tools are checked in:
+ https://github.com/unicode-org/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 15.0 update for ICU 72
+
+https://www.unicode.org/versions/Unicode15.0.0/
+https://www.unicode.org/versions/beta-15.0.0.html
+https://www.unicode.org/Public/15.0.0/ucd/
+https://www.unicode.org/reports/uax-proposed-updates.html
+https://www.unicode.org/reports/tr44/tr44-29.html
+
+https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15
+https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15
+https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41)
+
+* Command-line environment setup
+
+export UNICODE_DATA=~/unidata/uni15/20220830
+export CLDR_SRC=~/cldr/uni/src
+export ICU_ROOT=~/icu/uni
+export ICU_SRC=$ICU_ROOT/src
+export ICUDT=icudt72b
+export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+ cd $ICU_ROOT/dbg/icu4c
+ ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
+
+*** data files & enums & parser code
+
+* download files
+- same as for the early Unicode Tools setup and data refresh:
+ https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
+ https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + old way of fetching files: from the "Public" area on unicode.org
+ ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+ ~ split Unihan into single-property files
+ ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
+ + new way of fetching files, if available:
+ copy the files from a Unicode Tools workspace that is up to date with
+ https://github.com/unicode-org/unicodetools
+ and which might at this point be *ahead* of "Public"
+ ~ before the Unicode release copy files from "dev" subfolders, for example
+ https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
+ + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
+ or from the UCD/cldr/ output folder of the Unicode Tools:
+ Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
+ cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
+ or
+ cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
+ +age; 15.0 ; V15_0
+ +blk; Arabic_Ext_C ; Arabic_Extended_C
+ +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H
+ +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D
+ +blk; Devanagari_Ext_A ; Devanagari_Extended_A
+ +blk; Kaktovik_Numerals ; Kaktovik_Numerals
+ +blk; Kawi ; Kawi
+ +blk; Nag_Mundari ; Nag_Mundari
+ +sc ; Kawi ; Kawi
+ +sc ; Nagm ; Nag_Mundari
+ -> add new blocks to uchar.h before UBLOCK_COUNT
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
+ +10EC0..10EFF; Arabic Extended-C
+ +11B00..11B5F; Devanagari Extended-A
+ +11F00..11F5F; Kawi
+ -13430..1343F; Egyptian Hieroglyph Format Controls
+ +13430..1345F; Egyptian Hieroglyph Format Controls
+ +1D2C0..1D2DF; Kaktovik Numerals
+ +1E030..1E08F; Cyrillic Extended-D
+ +1E4D0..1E4FF; Nag Mundari
+ +31350..323AF; CJK Unified Ideographs Extension H
+ (ignore blocks whose end code point changed)
+ -> add new blocks to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add new blocks to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+ -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
+ Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
+ replace public static final int \1 = \2; \3
+ -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* build ICU
+ to make sure that there are no syntax errors
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in i18n/uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* Bazel build process
+
+See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
+for an overview and for setup instructions.
+
+Consider running `bazelisk --version` outside of the $ICU_SRC folder
+to find out the latest `bazel` version, and
+copying that version number into the $ICU_SRC/.bazeliskrc config file.
+(Revert if you find incompatibilities, or, better, update our build & config files.)
+
+* generate data files
+
+- remember to define the environment variables
+ (see the start of the section for this Unicode version)
+- cd $ICU_SRC
+- optional but not necessary:
+ bazelisk clean
+- build/bootstrap/generate new files:
+ icu4c/source/data/unidata/generate.sh
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+ ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt
+- Unicode 6.0..15.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- Note: Some of the collation data and test data will be updated below,
+ so at this time we might get some collation test failures.
+ Ignore these for now.
+- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
+ (no rule changes in Unicode 15)
+- update CLDR GraphemeBreakTest.txt
+ cd ~/unitools/mine/Generated
+ cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
+ cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
+ cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
+- Andy helps with RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools,
+ and a tool-tailored version goes into CLDR, see
+ https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- generate data files, as above (generate.sh), now to pick up new collation data
+- update CollationFCD.java:
+ copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+- rebuild ICU4C (make clean, make check, as usual)
+
+* Unihan collators
+ https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
+- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
+ check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
+- generate ICU zh collation data
+ instructions inspired by
+ https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
+ https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
+ + setup:
+ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ (didn't work without setting JAVA_HOME,
+ nor with the Google default of /usr/local/buildtools/java/jdk
+ [Google security limitations in the XML parser])
+ export TOOLS_ROOT=~/icu/uni/src/tools
+ export CLDR_DIR=~/cldr/uni/src
+ export CLDR_DATA_DIR=~/cldr/uni/src
+ (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
+ cd "$TOOLS_ROOT/cldr/lib"
+ ./install-cldr-jars.sh "$CLDR_DIR"
+ + generate the files we need
+ cd "$TOOLS_ROOT/cldr/cldr-to-icu"
+ ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
+ + diff
+ cd $ICU_SRC
+ meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
+ meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
+ + copy into the source tree
+ cd $ICU_SRC
+ cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
+ cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
+ you need to reconfigure with unicore data; see the "configure" line above.
+ output:
+ ...
+ make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ for example:
+ ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
+ ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
+ ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt
+ -->
+ +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
+ +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
+ or:
+ ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
+ -->
+ +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE
+ +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE
+ Unicode 15:
+ kawi 11F50..11F59 Kawi
+ nagm 1E4F0..1E4F9 Nag Mundari
+ https://github.com/unicode-org/cldr/pull/2041
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- if there is a merge conflict in icudata.jar, here is one way to deal with it:
+ + remove icudata.jar from the commit so that rebasing is trivial
+ + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
+ + ~/icu/uni/src$ git commit -a --amend
+ + switch to main, pull updates, switch back to the dev branch
+ + ~/icu/uni/src$ git rebase main
+ + rebuild icudata.jar
+ + ~/icu/uni/src$ git commit -a --amend
+ + ~/icu/uni/src$ git push -f
+- make sure that changes to Unicode tools are checked in:
+ https://github.com/unicode-org/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 14.0 update for ICU 70
+
+https://www.unicode.org/versions/Unicode14.0.0/
+https://www.unicode.org/versions/beta-14.0.0.html
+https://www.unicode.org/Public/14.0.0/ucd/
+https://www.unicode.org/reports/uax-proposed-updates.html
+https://www.unicode.org/reports/tr44/tr44-27.html
+
+https://unicode-org.atlassian.net/browse/CLDR-14801
+https://unicode-org.atlassian.net/browse/ICU-21635
+
+* Command-line environment setup
+
+export UNICODE_DATA=~/unidata/uni14/20210903
+export CLDR_SRC=~/cldr/uni/src
+export ICU_ROOT=~/icu/uni
+export ICU_SRC=$ICU_ROOT/src
+export ICUDT=icudt70b
+export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+ cd $ICU_ROOT/dbg/icu4c
+ ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
+
+*** data files & enums & parser code
+
+* download files
+- same as for the early Unicode Tools setup and data refresh:
+ https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
+ https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+ + split Unihan into single-property files
+ ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
+ + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
+ or from the UCD/cldr/ output folder of the Unicode Tools:
+ Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
+ cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
+ or
+ cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values:
+ [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])),
+ (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])),
+ (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
+ +age; 14.0 ; V14_0
+ +blk; Arabic_Ext_B ; Arabic_Extended_B
+ +blk; Cypro_Minoan ; Cypro_Minoan
+ +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B
+ +blk; Kana_Ext_B ; Kana_Extended_B
+ +blk; Latin_Ext_F ; Latin_Extended_F
+ +blk; Latin_Ext_G ; Latin_Extended_G
+ +blk; Old_Uyghur ; Old_Uyghur
+ +blk; Tangsa ; Tangsa
+ +blk; Toto ; Toto
+ +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A
+ +blk; Vithkuqi ; Vithkuqi
+ +blk; Znamenny_Music ; Znamenny_Musical_Notation
+ +jg ; Thin_Yeh ; Thin_Yeh
+ +jg ; Vertical_Tail ; Vertical_Tail
+ +sc ; Cpmn ; Cypro_Minoan
+ +sc ; Ougr ; Old_Uyghur
+ +sc ; Tnsa ; Tangsa
+ +sc ; Toto ; Toto
+ +sc ; Vith ; Vithkuqi
+ -> add new blocks to uchar.h before UBLOCK_COUNT
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
+ +0870..089F; Arabic Extended-B
+ +10570..105BF; Vithkuqi
+ +10780..107BF; Latin Extended-F
+ +10F70..10FAF; Old Uyghur
+ -11700..1173F; Ahom
+ +11700..1174F; Ahom
+ +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A
+ +12F90..12FFF; Cypro-Minoan
+ +16A70..16ACF; Tangsa
+ -18D00..18D8F; Tangut Supplement
+ +18D00..18D7F; Tangut Supplement
+ +1AFF0..1AFFF; Kana Extended-B
+ +1CF00..1CFCF; Znamenny Musical Notation
+ +1DF00..1DFFF; Latin Extended-G
+ +1E290..1E2BF; Toto
+ +1E7E0..1E7FF; Ethiopic Extended-B
+ (ignore blocks whose end code point changed)
+ -> add new blocks to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add new blocks to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+ -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
+ Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
+ replace public static final int \1 = \2; \3
+ -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+ -> add new joining groups to uchar.h & UCharacter.JoiningGroup
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* build ICU
+ to make sure that there are no syntax errors
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in i18n/uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* Bazel build process
+
+See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
+for an overview and for setup instructions.
+
+Consider running `bazelisk --version` outside of the $ICU_SRC folder
+to find out the latest `bazel` version, and
+copying that version number into the $ICU_SRC/.bazeliskrc config file.
+(Revert if you find incompatibilities, or, better, update our build & config files.)
+
+* generate data files
+
+- remember to define the environment variables
+ (see the start of the section for this Unicode version)
+- cd $ICU_SRC
+- optional but not necessary:
+ bazelisk clean
+- build/bootstrap/generate new files:
+ icu4c/source/data/unidata/generate.sh
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..14.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
+- update CLDR GraphemeBreakTest.txt
+ cd ~/unitools/mine/Generated
+ cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
+ cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
+ cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
+- Andy helps with RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools,
+ and a tool-tailored version goes into CLDR, see
+ https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- generate data files, as above (generate.sh), now to pick up new collation data
+- update CollationFCD.java:
+ copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+- rebuild ICU4C (make clean, make check, as usual)
+
+* Unihan collators
+ https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
+- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
+ check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
+- generate ICU zh collation data
+ instructions inspired by
+ https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
+ https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
+ + setup:
+ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ (didn't work without setting JAVA_HOME,
+ nor with the Google default of /usr/local/buildtools/java/jdk
+ [Google security limitations in the XML parser])
+ export TOOLS_ROOT=~/icu/uni/src/tools
+ export CLDR_DIR=~/cldr/uni/src
+ export CLDR_DATA_DIR=~/cldr/uni/src
+ (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
+ cd "$TOOLS_ROOT/cldr/lib"
+ ./install-cldr-jars.sh "$CLDR_DIR"
+ + generate the files we need
+ cd "$TOOLS_ROOT/cldr/cldr-to-icu"
+ ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
+ + diff
+ cd $ICU_SRC
+ meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
+ meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
+ + copy into the source tree
+ cd $ICU_SRC
+ cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
+ cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
+ you need to reconfigure with unicore data; see the "configure" line above.
+ output:
+ ...
+ make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ for example:
+ ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt
+ ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
+ ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt
+ -->
+ +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
+ Unicode 14:
+ tnsa 16AC0..16AC9 Tangsa
+ https://github.com/unicode-org/cldr/pull/1326
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ https://github.com/unicode-org/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 13.0 update for ICU 66
+
+https://www.unicode.org/versions/Unicode13.0.0/
+https://www.unicode.org/versions/beta-13.0.0.html
+https://www.unicode.org/Public/13.0.0/ucd/
+https://www.unicode.org/reports/uax-proposed-updates.html
+https://www.unicode.org/reports/tr44/tr44-25.html
+
+https://unicode-org.atlassian.net/browse/CLDR-13387
+https://unicode-org.atlassian.net/browse/ICU-20893
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni13/20200212
+CLDR_SRC=~/cldr/uni/src
+ICU_ROOT=~/icu/uni
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt66b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+ cd $ICU_ROOT/dbg/icu4c
+ ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
+
+*** data files & enums & parser code
+
+* download files
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+ + split Unihan into single-property files
+ ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
+ + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
+ or from the ucd/cldr/ output folder of the Unicode Tools:
+ Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
+ cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://sites.google.com/site/unicodetools/inputdata)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values:
+ [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi',
+ u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])),
+ (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])),
+ (u'InPC', set([u'Top_And_Bottom_And_Left']))]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ blk; Chorasmian ; Chorasmian
+ blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G
+ blk; Dives_Akuru ; Dives_Akuru
+ blk; Khitan_Small_Script ; Khitan_Small_Script
+ blk; Lisu_Sup ; Lisu_Supplement
+ blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing
+ blk; Tangut_Sup ; Tangut_Supplement
+ blk; Yezidi ; Yezidi
+ -> add to uchar.h before UBLOCK_COUNT
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+
+ sc ; Chrs ; Chorasmian
+ sc ; Diak ; Dives_Akuru
+ sc ; Kits ; Khitan_Small_Script
+ sc ; Yezi ; Yezidi
+ -> uscript.h & com.ibm.icu.lang.UScript
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+ InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left
+ -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* build ICU (make install)
+ to make sure that there are no syntax errors, and
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in i18n/uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- update the hardcoded version number there in the DIRECTORY path
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg/icu4c
+ bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
+
+ $ICU_ROOT/dbg$
+ mkdir -p tools/unicode/c
+ cd tools/unicode/c
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+- tool failure:
+ genprops: Script_Extensions indexes overflow bit field
+ genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR
+ -> uprops.icu data file format :
+ add two more bits to store a script code or Script_Extensions index
+ -> generator code, C++ & Java runtime, uprops.icu format version 7.7
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..13.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
+- Andy helps with RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+ diff the main mapping file, look for bad changes
+ (for example, more bytes per weight for common characters)
+ ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt
+ ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt
+
+- CLDR root data files are checked into $CLDR_SRC/common/uca/
+ cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- run genuca
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
+- rebuild ICU4C
+
+* Unihan collators
+ https://sites.google.com/site/unicodetools/unihan
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -ea
+ -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
+ -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
+ -DUVERSION=13.0.0
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd $CLDR_SRC
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd $CLDR_SRC
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- run CLDR unit tests, commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation
+ -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental
+ -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
+ -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
+ zh
+ and VM arguments
+ -ea
+ -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ for example, look for
+ ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
+ in new blocks (Blocks.txt)
+ Unicode 13:
+ diak 11950..11959 Dives_Akuru
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 12.1 update for ICU 64.2
+
+** This is an abbreviated update with one new character for the new
+** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
+https://en.wikipedia.org/wiki/Reiwa_period
+
+http://www.unicode.org/versions/Unicode12.1.0/
+
+ICU-20497 Unicode 12.1
+
+cldrbug 11978: Unicode 12.1
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni121/20190403
+CLDR_SRC=~/svn.cldr/uni
+ICU_ROOT=~/icu/uni
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt64b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+ cd $ICU_ROOT/dbg/icu4c
+ ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
+
+*** data files & enums & parser code
+
+* download files
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://sites.google.com/site/unicodetools/inputdata)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- update the hardcoded version number there in the DIRECTORY path
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg/icu4c
+ bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
+
+ $ICU_ROOT/dbg$
+ mkdir -p tools/unicode/c
+ cd tools/unicode/c
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+ genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..12.1: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+ diff the main mapping file, look for bad changes
+ (for example, more bytes per weight for common characters)
+ ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
+ ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
+
+- CLDR root data files are checked into $CLDR_SRC/common/uca/
+ cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- run genuca, see command line above
+- rebuild ICU4C
+
+* Unihan collators
+ https://sites.google.com/site/unicodetools/unihan
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -ea
+ -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
+ -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+ -DUVERSION=12.1.0
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd $CLDR_SRC
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd $CLDR_SRC
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- run CLDR unit tests, commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
+ -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
+ -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
+ -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
+ zh
+ and VM arguments
+ -ea
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ for example, look for
+ ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
+ in new blocks (Blocks.txt)
+ Unicode 12: using Unicode 12 CLDR ticket #11478
+ hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
+ wcho 1E2F0..1E2F9 Wancho
+ Unicode 11: using Unicode 11 CLDR ticket #10978
+ rohg 10D30..10D39 Hanifi_Rohingya
+ gong 11DA0..11DA9 Gunjala_Gondi
+ Earlier: CLDR tickets specific to adding new numbering systems.
+ Unicode 10: http://unicode.org/cldr/trac/ticket/10219
+ Unicode 9: http://unicode.org/cldr/trac/ticket/9692
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 12.0 update for ICU 64
+
+http://www.unicode.org/versions/Unicode12.0.0/
+http://unicode.org/versions/beta-12.0.0.html
+https://www.unicode.org/review/pri389/
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/reports/tr44/tr44-23.html
+
+ICU-20203 Unicode 12
+
+ICU-20111 move text layout properties data into a data file
+
+cldrbug 11478: Unicode 12
+Accidentally used ^/trunk instead of ^/branches/markus/uni12
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni12/20190309
+CLDR_SRC=~/svn.cldr/uni
+ICU_ROOT=~/icu/uni
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt63b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* download files
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://sites.google.com/site/unicodetools/inputdata)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values:
+ [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
+ u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
+ u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
+ (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
+ blk; Elymaic ; Elymaic
+ blk; Nandinagari ; Nandinagari
+ blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong
+ blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers
+ blk; Small_Kana_Ext ; Small_Kana_Extension
+ blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A
+ blk; Tamil_Sup ; Tamil_Supplement
+ blk; Wancho ; Wancho
+ -> add to uchar.h
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
+
+ sc ; Elym ; Elymaic
+ sc ; Hmnp ; Nyiakeng_Puachue_Hmong
+ sc ; Nand ; Nandinagari
+ sc ; Wcho ; Wancho
+ -> uscript.h & com.ibm.icu.lang.UScript
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- update the hardcoded version number there in the DIRECTORY path
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg/icu4c
+ bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
+
+ $ICU_ROOT/dbg$
+ mkdir -p tools/unicode/c
+ cd tools/unicode/c
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+ genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..12.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- update test of default bidi classes:
+ Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
+ see diffs in DerivedBidiClass.txt
+ + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
+ + UCharacterTest.java TestIteration() defaultBidi[]
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+ diff the main mapping file, look for bad changes
+ (for example, more bytes per weight for common characters)
+ ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
+ ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
+
+- CLDR root data files are checked into $CLDR_SRC/common/uca/
+ cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
+ FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible)
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
+ and cache its values.
+ Works as long as the script metadata is updated before the collation data.
+- rebuild ICU4C
+
+* Unihan collators
+ https://sites.google.com/site/unicodetools/unihan
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -ea
+ -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
+ -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+ -DUVERSION=12.0.0
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd $CLDR_SRC
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd $CLDR_SRC
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- run CLDR unit tests, commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
+ -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
+ -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
+ -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
+ zh
+ and VM arguments
+ -ea
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt63l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ for example, look for
+ ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
+ in new blocks (Blocks.txt)
+ Unicode 12: using Unicode 12 CLDR ticket #11478
+ hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
+ wcho 1E2F0..1E2F9 Wancho
+ Unicode 11: using Unicode 11 CLDR ticket #10978
+ rohg 10D30..10D39 Hanifi_Rohingya
+ gong 11DA0..11DA9 Gunjala_Gondi
+ Earlier: CLDR tickets specific to adding new numbering systems.
+ Unicode 10: http://unicode.org/cldr/trac/ticket/10219
+ Unicode 9: http://unicode.org/cldr/trac/ticket/9692
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni11/20180609
+CLDR_SRC=~/svn.cldr/uni
+ICU_ROOT=~/icu/mine
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt62b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** Links
+
+https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
+https://unicode-org.atlassian.net/browse/ICU-12850 vo
+
+*** data files & enums & parser code
+
+* API additions
+- for each of the three new enumerated properties
+ + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
+ + uchar.h: update UCHAR_INT_LIMIT
+ + uchar.h: add the enum U<long prop name>
+ with constants U_<short prop name>_<long value name>
+ + UProperty.java: add the constant <long prop name>
+ + UProperty.java: update INT_LIMIT
+ + UCharacter.java: add the interface <long prop name>
+ with constants <long value name>
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
+ names and aliases.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+* preparseucd.py changes
+- add new property short names (uppercase) to _prop_and_value_re
+ so that ParseUCharHeader() parses the new enum constants
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
+
+ $ICU_ROOT/dbg$
+ mkdir -p tools/unicode/c
+ cd tools/unicode/c
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+- rebuild ICU (make install) & tools
+
+* write data for runtime, hardcoded for now
+- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
+- generate new icu4c/source/common/ulayout_props_data.h
+- for each of the three new enumerated properties
+ + int property max value
+ + small, 8-bit UCPTrie
+ (A small 16-bit trie with bit fields for these three properties
+ is very nearly the same size as the sum of the three.)
+
+* wire into C++
+- uprops.cpp: #include ulayout_props_data.h
+- uprops.cpp: add getInPC() etc. functions
+- uprops.cpp: add lines to intProps[], include max values
+- uprops.h: add UPropertySource constants
+- uprops.cpp: add uprops_addPropertyStarts(src)
+- uniset_props.cpp: add to UnicodeSet_initInclusion()
+- intltest/ucdtest.cpp: write unit tests
+
+* update Java data files
+- refresh just the pnames.icu file with the new property [value] names, just to be safe
+- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* wire into Java
+- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
+- UCharacterProperty.java: for each new property
+ + create a nested class to hold its CodePointTrie
+ + initialize it from a string literal
+ + paste in the initializer printed by genprops
+ + add a new IntProperty object to the intProps[] array
+ + use the correct max int value for each property, also printed by genprops
+- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
+- UnicodeSet.java: add to getInclusions()
+- UCharacterTest.java: write unit tests
+
+---------------------------------------------------------------------------- ***
+
+Unicode 11.0 update for ICU 62
+
+http://www.unicode.org/versions/Unicode11.0.0/
+http://unicode.org/versions/beta-11.0.0.html
+https://www.unicode.org/review/pri372/
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/reports/tr44/tr44-21.html
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni11/20180521
+CLDR_SRC=~/svn.cldr/uni
+ICU_ROOT=~/svn.icu/uni
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt61b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** ICU Trac
+
+- ticket:13630: Unicode 11
+- ^/branches/markus/uni11
+
+*** CLDR Trac
+
+- cldrbug 10978: Unicode 11
+- ^/branches/markus/uni11
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* download files
+- mkdir -p $UNICODE_DATA
+- download Unicode files into $UNICODE_DATA
+ + subfolders: emoji, idna, security, ucd, uca
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+
+* for manual diffs and for Unicode Tools input data updates:
+ remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://sites.google.com/site/unicodetools/inputdata)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* preparseucd.py changes
+- fix other errors
+ NameError: unknown property Extended_Pictographic
+ -> add Extended_Pictographic binary property
+ -> add new short names for all Emoji properties
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values:
+ [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
+ u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
+ u'Indic_Siyaq_Numbers'])),
+ (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
+ (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
+ (u'GCB', set([u'LinkC', u'Virama'])),
+ (u'WB', set([u'WSegSpace']))]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ blk; Chess_Symbols ; Chess_Symbols
+ blk; Dogra ; Dogra
+ blk; Georgian_Ext ; Georgian_Extended
+ blk; Gunjala_Gondi ; Gunjala_Gondi
+ blk; Hanifi_Rohingya ; Hanifi_Rohingya
+ blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
+ blk; Makasar ; Makasar
+ blk; Mayan_Numerals ; Mayan_Numerals
+ blk; Medefaidrin ; Medefaidrin
+ blk; Old_Sogdian ; Old_Sogdian
+ blk; Sogdian ; Sogdian
+ -> add to uchar.h
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+
+ GCB; LinkC ; LinkingConsonant
+ GCB; Virama ; Virama
+ -> uchar.h & UCharacter.GraphemeClusterBreak
+ -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
+
+ InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
+ -> ignore: ICU does not yet support this property
+
+ jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
+ jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
+ -> uchar.h & UCharacter.JoiningGroup
+
+ sc ; Dogr ; Dogra
+ sc ; Gong ; Gunjala_Gondi
+ sc ; Maka ; Makasar
+ sc ; Medf ; Medefaidrin
+ sc ; Rohg ; Hanifi_Rohingya
+ sc ; Sogd ; Sogdian
+ sc ; Sogo ; Old_Sogdian
+ -> uscript.h & com.ibm.icu.lang.UScript
+ -> Nushu had been added already
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+ WB ; WSegSpace ; WSegSpace
+ -> uchar.h & UCharacter.WordBreak
+
+* New short names for emoji properties
+- see UTS #51
+- short names set in preparseucd.py
+
+* New properties
+- boolean emoji property Extended_Pictographic
+ -> added in preparseucd.py
+ -> uchar.h & UProperty.java
+- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
+ as shown in PropertyValueAliases.txt
+ -> ignore for now
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* update spoof checker UnicodeSet initializers:
+ inclusionPat & recommendedPat in uspoof.cpp
+ INCLUSION & RECOMMENDED in SpoofChecker.java
+- make sure that the Unicode Tools tree contains the latest security data files
+- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
+- update the hardcoded version number there in the DIRECTORY path
+- run the tool (no special environment variables needed)
+- copy & paste from the Console output into the .cpp & .java files
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg/icu4c
+ bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
+
+ $ICU_ROOT/dbg$
+ mkdir -p tools/unicode/c
+ cd tools/unicode/c
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+ genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
+- rebuild ICU (make install) & tools
+
+* Fix case props
+ genprops error: casepropsbuilder: too many exceptions words
+ genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
+- With the addition of Georgian Mtavruli capital letters,
+ there are now too many simple case mappings with big mapping deltas
+ that yield uncompressible exceptions.
+- Changing the data structure (now formatVersion 4),
+ adding one bit for no-simple-case-folding (for Cherokee), and
+ one optional slot for a big delta (for most faraway mappings),
+ together with another bit for whether that is negative.
+ This makes most Cherokee & Georgian etc. case mappings compressible,
+ reducing the number of exceptions words.
+- Further changes to gain one more bit for the exceptions index,
+ for future growth. Details see casepropsbuilder.cpp.
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..11.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- Andy handles RBBI & spoof check test failures
+
+- Errors in char.txt, word.txt, word_POSIX.txt like
+ createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
+ because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
+ -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
+ not empty, just to get ICU building.
+ -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
+ and properties together with the rules that used them (GB 10, WB 14).
+ -> Andy adjusts the rule sets further to sync with
+ Unicode 11 grapheme, word, and line break spec changes.
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+ diff the main mapping file, look for bad changes
+ (for example, more bytes per weight for common characters)
+ ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
+ ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
+
+- CLDR root data files are checked into $CLDR_SRC/common/uca/
+ cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
+ FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + look up the USCRIPT_ code for the new sample characters
+ (should be obvious from the comment in the error output)
+ + *add* mappings to sampleCharsToScripts[], do not replace them
+ (in case the script sample characters flip-flop)
+ + insert new scripts in DUCET script order, see the top_byte table
+ at the beginning of FractionalUCA.txt
+- rebuild ICU4C
+
+* Unihan collators
+ https://sites.google.com/site/unicodetools/unihan
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -ea
+ -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
+ -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+ -DUVERSION=11.0.0
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd $CLDR_SRC
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd $CLDR_SRC
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- run CLDR unit tests, commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
+ -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
+ -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
+ -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
+ zh
+ and VM arguments
+ -ea
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt61l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
+ Unicode 11: using Unicode 11 CLDR ticket #10978
+ rohg 10D30..10D39 Hanifi_Rohingya
+ gong 11DA0..11DA9 Gunjala_Gondi
+ Earlier: CLDR tickets specific to adding new numbering systems.
+ Unicode 10: http://unicode.org/cldr/trac/ticket/10219
+ Unicode 9: http://unicode.org/cldr/trac/ticket/9692
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Unicode 10.0 update for ICU 60
+
+http://www.unicode.org/versions/Unicode10.0.0/
+http://www.unicode.org/versions/beta-10.0.0.html
+http://blog.unicode.org/2017/03/unicode-100-beta-review.html
+http://www.unicode.org/review/pri350/
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/reports/tr44/tr44-19.html
+
+* Command-line environment setup
+
+UNICODE_DATA=~/unidata/uni10/20170605
+CLDR_SRC=~/svn.cldr/uni10
+ICU_ROOT=~/svn.icu/uni10
+ICU_SRC=$ICU_ROOT/src
+ICUDT=icudt60b
+ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
+ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
+
+*** ICU Trac
+
+- ticket:12985: Unicode 10
+- ticket:13061: undo hacks from emoji 5.0 update
+- ticket:13062: add Emoji_Component property
+- ^/branches/markus/uni10
+
+*** CLDR Trac
+
+- cldrbug 10055: Unicode 10
+- cldrbug 9882: Unicode 10 script metadata
+- cldrbug 10219: numbering systems for Unicode 10
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* download files
+- mkdir -p $UNICODE_DATA
+- download Unicode 10.0 files into $UNICODE_DATA
+ + subfolders: ucd, uca, idna, security
+ + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+- download emoji 5.0 files into $UNICODE_DATA/emoji
+
+* for manual diffs: remove version suffixes from the file names
+ ~$ unidata/desuffixucd.py $UNICODE_DATA
+ (see https://sites.google.com/site/unicodetools/inputdata)
+
+* process and/or copy files
+- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+ + For debugging, and tweaking how ppucd.txt is written,
+ the tool has an --only_ppucd option:
+ py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
+
+- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* preparseucd.py changes
+- remove or add new Unicode scripts from/to the
+ only-in-ISO-15924 list according to the error messages:
+ ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
+ -> adjust _scripts_only_in_iso15924 as indicated
+- fix other errors
+ Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
+ -> add vo=Vertical_Orientation to _ignored_properties
+ -> later removed again, parsing the file, even though we do not yet store data for runtime use
+
+* new constants for new property values
+- preparseucd.py error:
+ ValueError: missing uchar.h enum constants for some property values:
+ [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
+ u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
+ (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
+ u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
+ u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
+ (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
+ = PropertyValueAliases.txt new property values (diff old & new .txt files)
+ blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
+ blk; Kana_Ext_A ; Kana_Extended_A
+ blk; Masaram_Gondi ; Masaram_Gondi
+ blk; Nushu ; Nushu
+ blk; Soyombo ; Soyombo
+ blk; Syriac_Sup ; Syriac_Supplement
+ blk; Zanabazar_Square ; Zanabazar_Square
+ -> add to uchar.h
+ use long property names for enum constants,
+ for the trailing comment get the block start code point: diff old & new Blocks.txt
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+
+ jg ; Malayalam_Bha ; Malayalam_Bha
+ jg ; Malayalam_Ja ; Malayalam_Ja
+ jg ; Malayalam_Lla ; Malayalam_Lla
+ jg ; Malayalam_Llla ; Malayalam_Llla
+ jg ; Malayalam_Nga ; Malayalam_Nga
+ jg ; Malayalam_Nna ; Malayalam_Nna
+ jg ; Malayalam_Nnna ; Malayalam_Nnna
+ jg ; Malayalam_Nya ; Malayalam_Nya
+ jg ; Malayalam_Ra ; Malayalam_Ra
+ jg ; Malayalam_Ssa ; Malayalam_Ssa
+ jg ; Malayalam_Tta ; Malayalam_Tta
+ -> uchar.h & UCharacter.JoiningGroup
+
+ sc ; Gonm ; Masaram_Gondi
+ sc ; Nshu ; Nushu
+ sc ; Soyo ; Soyombo
+ sc ; Zanb ; Zanabazar_Square
+ -> uscript.h & com.ibm.icu.lang.UScript
+ -> Nushu had been added already
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* New properties as shown in PropertyValueAliases.txt changes
+- boolean Emoji_Component from emoji 5
+ -> uchar.h & UProperty.java
+- boolean
+ # Regional_Indicator (RI)
+
+ RI ; N ; No ; F ; False
+ RI ; Y ; Yes ; T ; True
+ -> uchar.h & UProperty.java
+ -> single immutable range, to be hardcoded
+- boolean
+ # Prepended_Concatenation_Mark (PCM)
+
+ PCM; N ; No ; F ; False
+ PCM; Y ; Yes ; T ; True
+ -> was new in Unicode 9
+ -> uchar.h & UProperty.java
+- enumerated
+ # Vertical_Orientation (vo)
+
+ vo ; R ; Rotated
+ vo ; Tr ; Transformed_Rotated
+ vo ; Tu ; Transformed_Upright
+ vo ; U ; Upright
+ -> only pre-parsed for now, but not yet stored for runtime use
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg/icu4c
+ bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+$ICU_SRC/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
+
+ $ICU_ROOT/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ $ICU_ROOT/dbg/tools/unicode/c$
+ genprops/genprops $ICU_SRC/icu4c
+ genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..10.0: U+2260, U+226E, U+226F
+- nothing new in this Unicode version, no test file to update
+
+* run & fix ICU4C tests
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+- CLDR root data files are checked into $CLDR_SRC/common/uca/
+ cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
+
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
+
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
+ FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + look up the USCRIPT_ code for the new sample characters
+ (should be obvious from the comment in the error output)
+ + *add* mappings to sampleCharsToScripts[], do not replace them
+ (in case the script sample characters flip-flop)
+ + insert new scripts in DUCET script order, see the top_byte table
+ at the beginning of FractionalUCA.txt
+- rebuild ICU4C
+
+* Unihan collators
+ https://sites.google.com/site/unicodetools/unihan
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -ea
+ -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
+ -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
+ -DUVERSION=10.0.0
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd $CLDR_SRC
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd $CLDR_SRC
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- run CLDR unit tests, commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
+ -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
+ -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
+ -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
+ zh
+ and VM arguments
+ -ea
+ -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt60l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
+or
+- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC/icu4c/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** CLDR numbering systems
+- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
+ Unicode 10: http://unicode.org/cldr/trac/ticket/10219
+ Unicode 9: http://unicode.org/cldr/trac/ticket/9692
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools are checked in:
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+
+---------------------------------------------------------------------------- ***
+
+Emoji 5.0 update for ICU 59
+- ICU 59 mostly remains on Unicode 9.0
+- except updates bidi and segmentation data to Unicode 10 beta
+
+First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
+
+* Command-line environment setup
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
+ICUDT=icudt59b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
+UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
+
+*** ICU Trac
+
+- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
+- changes directly on trunk
+
+*** data files & enums & parser code
+
+* download files
+
+- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
+- download emoji 5.0 beta files into the same uni90e50 folder
+- download Unicode 10.0 beta files: ucd
+ + copy Unicode 10 bidi files to the uni90e50/ucd folder:
+ BidiBrackets.txt
+ BidiCharacterTest.txt
+ BidiMirroring.txt
+ BidiTest.txt
+ extracted/DerivedBidiClass.txt
+ + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
+ LineBreak.txt
+ auxiliary/*
+
+* preparseucd.py changes
+- adjust for combined trunks
+- write new copyright lines
+- ignore new Emoji_Component property for now
+
+* process and/or copy files
+- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
+ + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+
+- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
+
+* build Unicode tools using CMake+make
+
+~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
+# Location of the ICU4C source tree.
+set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
+
+ ~/svn.icu/trunk/dbg/tools/unicode/c$
+ cmake ../../../../src/tools/unicode/c
+ make
+
+* generate core properties data files
+ ~/svn.icu/trunk/dbg/tools/unicode/c$
+ genprops/genprops $ICU4C_SRC_DIR
+- rebuild ICU (make install) & tools
+
+* run & fix ICU4C tests
+- Andy handles RBBI & spoof check test failures
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt59l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
+or
+- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU4C_SRC_DIR/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+---------------------------------------------------------------------------- ***
+
+Unicode 9.0 update for ICU 58
+
+* Command-line environment setup
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICUDT=icudt58b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+UNIDATA=$ICU_SRC_DIR/source/data/unidata
+
+http://www.unicode.org/review/pri323/ -- beta review
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/versions/beta-9.0.0.html
+http://www.unicode.org/versions/Unicode9.0.0/
+http://www.unicode.org/reports/tr44/tr44-17.html
+
+*** ICU Trac
+
+- ticket:12526: integrate Unicode 9
+- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
+- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
+
+*** CLDR Trac
+
+- cldrbug 9414: UCA 9
+- ^/branches/markus/uni90 at r11518 from trunk at r11517
+
+- cldrbug 8745: Unicode 9.0 script metadata
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- only for manual diffs: remove version suffixes from the file names
+ ~/unidata/uni70/20140403$ ../../desuffixucd.py .
+ (see https://sites.google.com/site/unicodetools/inputdata)
+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+
+- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
+ and copy to $UNIDATA
+ cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
+
+* preparseucd.py changes
+- remove or add new Unicode scripts from/to the
+ only-in-ISO-15924 list according to the error messages:
+ ValueError: remove ['Tang'] from _scripts_only_in_iso15924
+ ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
+ ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
+ ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- DerivedNumericValues.txt new numeric values
+ 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
+ 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
+ 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
+ 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
+ 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
+ -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
+ uchar.c, UCharacterProperty.java
+ to support a new series of values
+- adjust preparseucd.py for Tangut algorithmic names
+ in ppucd.txt:
+ algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
+ ->
+ algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
+- avoid block-compressing most String/Miscellaneous property values,
+ triggered by genprops not coping with a multi-code point Case_Folding on
+ block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
+ keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
+
+* PropertyAliases.txt changes
+- 1 new property PCM=Prepended_Concatenation_Mark
+ Ignore: Only useful for layout engines.
+ Ok to list in ppucd.txt.
+
+* PropertyValueAliases.txt new property values
+ blk; Adlam ; Adlam
+ blk; Bhaiksuki ; Bhaiksuki
+ blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
+ blk; Glagolitic_Sup ; Glagolitic_Supplement
+ blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
+ blk; Marchen ; Marchen
+ blk; Mongolian_Sup ; Mongolian_Supplement
+ blk; Newa ; Newa
+ blk; Osage ; Osage
+ blk; Tangut ; Tangut
+ blk; Tangut_Components ; Tangut_Components
+ -> add to uchar.h
+ use long property names for enum constants
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+
+ GCB; EB ; E_Base
+ GCB; EBG ; E_Base_GAZ
+ GCB; EM ; E_Modifier
+ GCB; GAZ ; Glue_After_Zwj
+ GCB; ZWJ ; ZWJ
+ -> uchar.h & UCharacter.GraphemeClusterBreak
+
+ jg ; African_Feh ; African_Feh
+ jg ; African_Noon ; African_Noon
+ jg ; African_Qaf ; African_Qaf
+ -> uchar.h & UCharacter.JoiningGroup
+
+ lb ; EB ; E_Base
+ lb ; EM ; E_Modifier
+ lb ; ZWJ ; ZWJ
+ -> uchar.h & UCharacter.LineBreak
+
+ sc ; Adlm ; Adlam
+ sc ; Bhks ; Bhaiksuki
+ sc ; Marc ; Marchen
+ sc ; Newa ; Newa
+ sc ; Osge ; Osage
+ sc ; Tang ; Tangut
+ -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
+
+ WB ; EB ; E_Base
+ WB ; EBG ; E_Base_GAZ
+ WB ; EM ; E_Modifier
+ WB ; GAZ ; Glue_After_Zwj
+ WB ; ZWJ ; ZWJ
+ -> uchar.h & UCharacter.WordBreak
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg
+ bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
+
+* build Unicode tools using CMake+make
+
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
+
+ # Location (--prefix) of where ICU was installed.
+ set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
+ # Location of the ICU source tree.
+ set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
+
+ ~/svn.icutools/trunk/dbg/unicode/c$
+ cmake ../../../src/unicode/c
+ make
+
+* generate core properties data files
+ ~/svn.icutools/trunk/dbg/unicode/c$
+ genprops/genprops $ICU_SRC_DIR
+ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
+ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..9.0: U+2260, U+226E, U+226F
+- nothing new in 9.0, no test file to update
+
+* run & fix ICU4C tests
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
+ cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
+
+- cd (CLDR UCA branch)/common/uca/
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
+
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
+ FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + look up the USCRIPT_ code for the new sample characters
+ (should be obvious from the comment in the error output)
+ + *add* mappings to sampleCharsToScripts[], do not replace them
+ (in case the script sample characters flip-flop)
+ + insert new scripts in DUCET script order, see the top_byte table
+ at the beginning of FractionalUCA.txt
+- rebuild ICU4C
+
+* Unihan collators
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollators
+ with VM arguments
+ -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
+ -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
+ -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
+ -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
+ -DUVERSION=9.0.0
+ -ea
+- run Unicode Tools
+ org.unicode.draft.GenerateUnihanCollatorFiles
+ with the same arguments
+- check CLDR diffs
+ cd ~/svn.cldr/trunk
+ meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
+ meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
+- copy to CLDR
+ cd ~/svn.cldr/trunk
+ cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
+ cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
+- commit to CLDR
+- generate ICU zh collation data: run CLDR
+ org.unicode.cldr.icu.NewLdml2IcuConverter
+ with program arguments
+ -t collation
+ -s /home/mscherer/svn.cldr/trunk/common/collation
+ -m /home/mscherer/svn.cldr/trunk/common/supplemental
+ -d /home/mscherer/svn.icu/trunk/src/source/data/coll
+ -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
+ zh
+ and VM arguments
+ -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt58l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd ~/svn.icu/trunk/dbg/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC_DIR/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** LayoutEngine script information
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
+ (a plain text file)
+ which maps ICU versions to the numbers of script/language constants
+ that were added then.
+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
+
+ The generated files have a current copyright date and "@deprecated" statement.
+
+* Review changes, fix Java tool if necessary, and copy to ICU4C
+ cd ~/svn.icu4j/trunk/src
+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools & ICU tools are checked in
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+ http://bugs.icu-project.org/trac/log/tools/trunk
+
+---------------------------------------------------------------------------- ***
+
+New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764
+
+Adding
+- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
+- new combination/alias codes: Hanb, Jamo
+ - used in CLDR 29 and in spoof checker
+- new Z* code: Zsye
+
+Add new codes to uscript.h & UScript.java, see Unicode update logs.
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2; \3
+
+Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
+add new script codes.
+"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
+
+Note: If we have to run preparseucd.py again before the Unicode 9 update,
+then we need to manually keep/restore the new script codes.
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICUDT=icudt57b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+UNIDATA=$ICU_SRC_DIR/source/data/unidata
+
+Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
+see https://unicode-org.atlassian.net/browse/ICU-12141
+
+make install, then icutools cmake & make, then
+~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
+
+Generate Java data as usual, only update pnames.icu & uprops.icu.
+
+*** LayoutEngine script information
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
+ (a plain text file)
+ which maps ICU versions to the numbers of script/language constants
+ that were added then.
+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
+
+ The generated files have a current copyright date and "@deprecated" statement.
+
+* Review changes, fix Java tool if necessary, and copy to ICU4C
+ cd ~/svn.icu4j/trunk/src
+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
+
+---------------------------------------------------------------------------- ***
+
+Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802
+
+Edit preparseucd.py to add & parse new properties.
+They share the UCD property namespace but are not listed in PropertyAliases.txt.
+
+Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
+Initial data from emoji/2.0/
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICUDT=icudt56b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+UNIDATA=$ICU_SRC_DIR/source/data/unidata
+
+Add binary-property constants to uchar.h enum UProperty & UProperty.java.
+
+~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
+(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
+
+Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
+
+make install, then icutools cmake & make, then
+~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
+
+Generate Java data as usual, only update pnames.icu & uprops.icu.
+
+---------------------------------------------------------------------------- ***
+
+Unicode 8.0 update for ICU 56
+
+* Command-line environment setup
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICUDT=icudt56b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+UNIDATA=$ICU_SRC_DIR/source/data/unidata
+
+http://www.unicode.org/review/pri297/ -- beta review
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://unicode.org/versions/beta-8.0.0.html
+http://www.unicode.org/versions/Unicode8.0.0/
+http://www.unicode.org/reports/tr44/tr44-15.html
+
+*** ICU Trac
+
+- ticket:11574: Unicode 8
+- C++ branches/markus/uni80 at r37351 from trunk at r37343
+- Java branches/markus/uni80 at r37352 from trunk at r37338
+
+*** CLDR Trac
+
+- cldrbug 8311: UCA 8
+- branches/markus/uni80 at r11518 from trunk at r11517
+
+- cldrbug 8109: Unicode 8.0 script metadata
+- cldrbug 8418: Updated segmentation for Unicode 8.0
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- only for manual diffs: remove version suffixes from the file names
+ ~/unidata/uni70/20140403$ ../../desuffixucd.py .
+ (see https://sites.google.com/site/unicodetools/inputdata)
+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+
+- also: from http://unicode.org/Public/security/8.0.0/ download new
+ confusables.txt & confusablesWholeScript.txt
+ and copy to $UNIDATA
+ ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
+ ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
+
+* initial preparseucd.py changes
+- remove new Unicode scripts from the
+ only-in-ISO-15924 list according to the error message:
+ ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
+ from _scripts_only_in_iso15924
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- property and file name change:
+ IndicMatraCategory -> IndicPositionalCategory
+- UnicodeData.txt unusual numeric values (improper fractions)
+ 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
+ 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
+ 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
+ 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
+ 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
+ 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
+ 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
+ 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
+ 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
+ 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
+ -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
+ which are listed in DerivedNumericValues.txt;
+ keeps storage in data file simple
+
+* PropertyValueAliases.txt changes
+- 10 new Block (blk) values:
+ blk; Ahom ; Ahom
+ blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
+ blk; Cherokee_Sup ; Cherokee_Supplement
+ blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
+ blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
+ blk; Hatran ; Hatran
+ blk; Multani ; Multani
+ blk; Old_Hungarian ; Old_Hungarian
+ blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
+ blk; Sutton_SignWriting ; Sutton_SignWriting
+ -> add to uchar.h
+ use long property names for enum constants
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- 6 new Script (sc) values:
+ sc ; Ahom ; Ahom
+ sc ; Hatr ; Hatran
+ sc ; Hluw ; Anatolian_Hieroglyphs
+ sc ; Hung ; Old_Hungarian
+ sc ; Mult ; Multani
+ sc ; Sgnw ; SignWriting
+ -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg
+ bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
+
+* build Unicode tools using CMake+make
+
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
+
+ # Location (--prefix) of where ICU was installed.
+ set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
+ # Location of the ICU source tree.
+ set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
+
+ ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
+ ~/svn.icutools/trunk/dbg/unicode/c$ make
+
+* generate core properties data files
+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
+- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
+- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
+- rebuild ICU (make install) & tools
+- run genuca again (see step above) so that it picks up the new nfc.nrm
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..8.0: U+2260, U+226E, U+226F
+- nothing new in 8.0, no test file to update
+
+* run & fix ICU4C tests
+- bad Cherokee case folding due to difference in fallbacks:
+ UCD case folding falls back to no mapping,
+ ICU runtime case folding falls back to lowercasing;
+ fixed casepropsbuilder.cpp to generate scf mappings to self
+ when there is an slc mapping but no scf
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
+- cd (CLDR UCA branch)/common/uca/
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + look up the script for the new sample characters
+ (e.g., in FractionalUCA.txt)
+ + *add* mappings to sampleCharsToScripts[], do not replace them
+ (in case the script sample characters flip-flop)
+ + insert new scripts in DUCET script order, see the top_byte table
+ at the beginning of FractionalUCA.txt
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+- fixed bug in CollationWeights::getWeightRanges()
+ exposed by new data and CollationTest::TestRootElements
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt56l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd ~/svn.icu/trunk/dbg/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC_DIR/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** LayoutEngine script information
+
+* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
+ because the layout engine was deprecated in ICU 54.
+ Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
+ to write lines that we used to add manually.
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
+ (a plain text file)
+ which maps ICU versions to the numbers of script/language constants
+ that were added then.
+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
+
+ The generated files have a current copyright date and "@deprecated" statement.
+
+* Review changes, fix Java tool if necessary, and copy to ICU4C
+ cd ~/svn.icu4j/trunk/src
+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools & ICU tools are checked in
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+ http://bugs.icu-project.org/trac/log/tools/trunk
+
+---------------------------------------------------------------------------- ***
+
+Unicode 7.0 update for ICU 54
+
+http://www.unicode.org/review/pri271/ -- beta review
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
+http://www.unicode.org/reports/tr44/tr44-13.html
+
+*** ICU Trac
+
+- ticket 10821: Unicode 7.0, UCA 7.0
+- C++ branches/markus/uni70 at r35584 from trunk at r35580
+- Java branches/markus/uni70 at r35587 from trunk at r35545
+
+*** CLDR Trac
+
+- ticket 7195: UCA 7.0 CLDR root collation
+- branches/markus/uni70 at r10062 from trunk at r10061
+
+- ticket 6762: script metadata for Unicode 7.0 new scripts
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- only for manual diffs: remove version suffixes from the file names
+ ~/unidata/uni70/20140403$ ../../desuffixucd.py .
+ (see https://sites.google.com/site/unicodetools/inputdata)
+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+- Restore TODO diffs in source/data/unidata/UCARules.txt
+ cd $ICU_SRC_DIR
+ meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
+- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
+
+- also: from http://unicode.org/Public/security/7.0.0/ download new
+ confusables.txt & confusablesWholeScript.txt
+ and copy to $ICU_ROOT/src/source/data/unidata/
+
+* initial preparseucd.py changes
+- remove new Unicode scripts from the
+ only-in-ISO-15924 list according to the error message:
+ ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
+ 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
+ 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
+ from _scripts_only_in_iso15924
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- NamesList.txt now has a heading with a non-ASCII character
+ + keep ppucd.txt in platform charset, rather than changing tool/test parsers
+ + escape non-ASCII characters in heading comments
+- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
+ + get the copyright from the first file whose copyright line contains the current year
+
+* PropertyValueAliases.txt changes
+- 32 new Block (blk) values:
+ blk; Bassa_Vah ; Bassa_Vah
+ blk; Caucasian_Albanian ; Caucasian_Albanian
+ blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
+ blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
+ blk; Duployan ; Duployan
+ blk; Elbasan ; Elbasan
+ blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
+ blk; Grantha ; Grantha
+ blk; Khojki ; Khojki
+ blk; Khudawadi ; Khudawadi
+ blk; Latin_Ext_E ; Latin_Extended_E
+ blk; Linear_A ; Linear_A
+ blk; Mahajani ; Mahajani
+ blk; Manichaean ; Manichaean
+ blk; Mende_Kikakui ; Mende_Kikakui
+ blk; Modi ; Modi
+ blk; Mro ; Mro
+ blk; Myanmar_Ext_B ; Myanmar_Extended_B
+ blk; Nabataean ; Nabataean
+ blk; Old_North_Arabian ; Old_North_Arabian
+ blk; Old_Permic ; Old_Permic
+ blk; Ornamental_Dingbats ; Ornamental_Dingbats
+ blk; Pahawh_Hmong ; Pahawh_Hmong
+ blk; Palmyrene ; Palmyrene
+ blk; Pau_Cin_Hau ; Pau_Cin_Hau
+ blk; Psalter_Pahlavi ; Psalter_Pahlavi
+ blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
+ blk; Siddham ; Siddham
+ blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
+ blk; Sup_Arrows_C ; Supplemental_Arrows_C
+ blk; Tirhuta ; Tirhuta
+ blk; Warang_Citi ; Warang_Citi
+ -> add to uchar.h
+ use long property names for enum constants
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- 28 new Joining_Group (jg) values:
+ jg ; Manichaean_Aleph ; Manichaean_Aleph
+ jg ; Manichaean_Ayin ; Manichaean_Ayin
+ jg ; Manichaean_Beth ; Manichaean_Beth
+ jg ; Manichaean_Daleth ; Manichaean_Daleth
+ jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
+ jg ; Manichaean_Five ; Manichaean_Five
+ jg ; Manichaean_Gimel ; Manichaean_Gimel
+ jg ; Manichaean_Heth ; Manichaean_Heth
+ jg ; Manichaean_Hundred ; Manichaean_Hundred
+ jg ; Manichaean_Kaph ; Manichaean_Kaph
+ jg ; Manichaean_Lamedh ; Manichaean_Lamedh
+ jg ; Manichaean_Mem ; Manichaean_Mem
+ jg ; Manichaean_Nun ; Manichaean_Nun
+ jg ; Manichaean_One ; Manichaean_One
+ jg ; Manichaean_Pe ; Manichaean_Pe
+ jg ; Manichaean_Qoph ; Manichaean_Qoph
+ jg ; Manichaean_Resh ; Manichaean_Resh
+ jg ; Manichaean_Sadhe ; Manichaean_Sadhe
+ jg ; Manichaean_Samekh ; Manichaean_Samekh
+ jg ; Manichaean_Taw ; Manichaean_Taw
+ jg ; Manichaean_Ten ; Manichaean_Ten
+ jg ; Manichaean_Teth ; Manichaean_Teth
+ jg ; Manichaean_Thamedh ; Manichaean_Thamedh
+ jg ; Manichaean_Twenty ; Manichaean_Twenty
+ jg ; Manichaean_Waw ; Manichaean_Waw
+ jg ; Manichaean_Yodh ; Manichaean_Yodh
+ jg ; Manichaean_Zayin ; Manichaean_Zayin
+ jg ; Straight_Waw ; Straight_Waw
+ -> uchar.h & UCharacter.JoiningGroup
+- 23 new Script (sc) values:
+ sc ; Aghb ; Caucasian_Albanian
+ sc ; Bass ; Bassa_Vah
+ sc ; Dupl ; Duployan
+ sc ; Elba ; Elbasan
+ sc ; Gran ; Grantha
+ sc ; Hmng ; Pahawh_Hmong
+ sc ; Khoj ; Khojki
+ sc ; Lina ; Linear_A
+ sc ; Mahj ; Mahajani
+ sc ; Mani ; Manichaean
+ sc ; Mend ; Mende_Kikakui
+ sc ; Modi ; Modi
+ sc ; Mroo ; Mro
+ sc ; Narb ; Old_North_Arabian
+ sc ; Nbat ; Nabataean
+ sc ; Palm ; Palmyrene
+ sc ; Pauc ; Pau_Cin_Hau
+ sc ; Perm ; Old_Permic
+ sc ; Phlp ; Psalter_Pahlavi
+ sc ; Sidd ; Siddham
+ sc ; Sind ; Khudawadi
+ sc ; Tirh ; Tirhuta
+ sc ; Wara ; Warang_Citi
+ -> uscript.h (many were added before)
+ comment "Mende Kikakui" for USCRIPT_MENDE
+ add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2; \3
+- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
+ (added 2012-11-01)
+ Ahom 338 Ahom
+ Hatr 127 Hatran
+ Mult 323 Multani
+ (added 2013-10-12)
+ Modi 324 Modi
+ Pauc 263 Pau Cin Hau
+ Sidd 302 Siddham
+ -> uscript.h (some overlap with additions from Unicode)
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2; \3
+ -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
+ -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
+
+* generate normalization data files
+- cd $ICU_ROOT/dbg
+- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+- UNIDATA=$ICU_SRC_DIR/source/data/unidata
+- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
+- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
+
+* build Unicode tools using CMake+make
+
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
+# Location of the ICU source tree.
+set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
+
+~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
+~/svn.icutools/trunk/dbg/unicode/c$ make
+
+* genprops work
+- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
+ + add second array of Joining_Group values for at most 10800..10FFF
+ icutools: unicode/c/genprops/bidipropsbuilder.cpp
+ icu: source/common/ubidi_props.h/.c/_data.h
+ icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
+
+* generate core properties data files
+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
+- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
+- rebuild ICU (make install) & tools
+- run genuca again (see step above) so that it picks up the new nfc.nrm
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..7.0: U+2260, U+226E, U+226F
+- nothing new in 7.0, no test file to update
+
+* run & fix ICU4C tests
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt53l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
+ echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ ICUDT=icudt54b
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cd ~/svn.icu/uni70/dbg/data/out/icu4j
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+- refresh ICU4J
+ ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC_DIR/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* UCA
+
+- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
+- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
+- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
+- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
+- output files are in ~/svn.unitools/Generated/uca/7.0.0/
+- review data; compare files, use blankweights.sed or similar
+ ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
+- cd ~/svn.unitools/Generated/uca/7.0.0/
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ (note removing the underscore before "Rules")
+ cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
+ cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
+- run genuca, see command line above
+- rebuild ICU4C
+- refresh ICU4J collation data:
+ (subset of instructions above for properties data refresh, except copies all coll/*)
+ ICUDT=icudt54b
+ ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
+- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
+ ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+* run & fix ICU4J tests
+
+*** LayoutEngine script information
+
+(For details see the Unicode 5.2 change log below.)
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ The generated files have a current copyright date and "@stable" statement.
+ ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
+ for "born stable" Unicode API constants, and to stop parsing ICU version numbers
+ which may not contain dots any more.
+
+- diff current <icu>/source/layout files vs. generated ones
+ ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ review and manually merge desired changes;
+ fix gratuitous changes, incorrect @draft/@stable and missing aliases;
+ Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
+- if you just copy the above files, then
+ fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
+ manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+
+---------------------------------------------------------------------------- ***
+
+Unicode 6.3 update
+
+http://www.unicode.org/review/pri249/ -- beta review
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
+http://www.unicode.org/reports/tr44/tr44-11.html
+
+*** ICU Trac
+
+- ticket 10128: update ICU to Unicode 6.3 beta
+- ticket 10168: update ICU to Unicode 6.3 final
+- C++ branches/markus/uni63 at r33552 from trunk at r33551
+- Java branches/markus/uni63 at r33550 from trunk at r33553
+
+- ticket 10142: implement Unicode 6.3 bidi algorithm additions
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+ (configure.in & configure: have been modified to extract the version from uchar.h)
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD, UCA & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- modify preparseucd.py:
+ parse new file BidiBrackets.txt
+ with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+- Check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out.
+
+* PropertyAliases.txt changes
+- 1 new Enumerated Property
+ bpt ; Bidi_Paired_Bracket_Type
+ -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
+ -> ubidi_props.h & .c & UBiDiProps.java
+ -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
+ -> uprops.cpp
+ -> change ubidi.icu format version from 2.0 to 2.1
+- 1 new Miscellaneous Property
+ bpb ; Bidi_Paired_Bracket
+ -> uchar.h & UProperty.java
+ -> ppucd.h & .cpp
+
+* PropertyValueAliases.txt changes
+- 3 Bidi_Paired_Bracket_Type (bpt) values:
+ bpt; c ; Close
+ bpt; n ; None
+ bpt; o ; Open
+ -> uchar.h & UCharacter.BidiPairedBracketType
+ -> ubidi_props.h & .c & UBiDiProps.java
+ -> change ubidi.icu format version from 2.0 to 2.1
+- 4 new Bidi_Class (bc) values:
+ bc ; FSI ; First_Strong_Isolate
+ bc ; LRI ; Left_To_Right_Isolate
+ bc ; RLI ; Right_To_Left_Isolate
+ bc ; PDI ; Pop_Directional_Isolate
+ -> uchar.h & UCharacterEnums.ECharacterDirection
+ -> until the bidi code gets updated,
+ Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
+- 3 new Word_Break (WB) values:
+ WB ; HL ; Hebrew_Letter
+ WB ; SQ ; Single_Quote
+ WB ; DQ ; Double_Quote
+ -> uchar.h & UCharacter.WordBreak
+ -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
+- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
+ (added 2012-10-16)
+ Aghb 239 Caucasian Albanian
+ Mahj 314 Mahajani
+ -> uscript.h
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2;\3
+ -> preparseucd.py _scripts_only_in_iso15924
+ -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+ -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+
+* generate normalization data files
+- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
+- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
+- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
+- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
+
+* build Unicode tools using CMake+make
+
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
+
+# Location (--prefix) of where ICU was installed.
+set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
+# Location of the ICU source tree.
+set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
+
+~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
+~/svn.icutools/trunk/dbg/unicode/c$ make
+
+* generate core properties data files
+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
+- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
+- rebuild ICU (make install) & tools
+- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..6.3: U+2260, U+226E, U+226F
+- nothing new in 6.3, no test file to update
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt52l
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
+ echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
+- refresh ICU4J
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
+
+- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
+- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ (note removing the underscore before "Rules")
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
+- check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out
+- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
+- run genuca, see command line above
+- rebuild ICU4C
+- refresh ICU4J collation data:
+ (subset of instructions above for properties data refresh, except copies all coll/*)
+ ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
+ ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* test ICU, fix test code where necessary
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+*** LayoutEngine script information
+- skipped for Unicode 6.3: no new scripts
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+
+---------------------------------------------------------------------------- ***
+
+Unicode 6.2 update
+
+http://www.unicode.org/review/pri230/
+http://www.unicode.org/versions/beta-6.2.0.html
+http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
+http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
+http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
+http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
+http://www.unicode.org/reports/tr46/tr46-8.html IDNA
+http://unicode.org/Public/idna/6.2.0/
+
+*** ICU Trac
+
+- ticket 9515: Unicode 6.2: final ICU update
+
+- ticket 9514: UCA 6.2: fix UCARules.txt
+
+- ticket 9437: update ICU to Unicode 6.2
+- C++ branches/markus/uni62 at r32050 from trunk at r32041
+- Java branches/markus/uni62 at r32068 from trunk at r32066
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+ (configure.in & configure: have been modified to extract the version from uchar.h)
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD, UCA & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- modify preparseucd.py: NamesList.txt is now in UTF-8
+- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+- Check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out.
+
+* PropertyValueAliases.txt changes
+- 1 new Line_Break (lb) value:
+ lb ; RI ; Regional_Indicator
+ -> uchar.h & UCharacter.LineBreak
+- 1 new Word_Break (WB) value:
+ WB ; RI ; Regional_Indicator
+ -> uchar.h & UCharacter.WordBreak
+- 1 new Grapheme_Cluster_Break (GCB) value:
+ GCB; RI ; Regional_Indicator
+ -> uchar.h & UCharacter.GraphemeClusterBreak
+
+* 3 new numeric values
+ The new value -1, which was really supposed to be NaN but that would have required
+ new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
+ but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
+ cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
+ cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
+ The two new values 216000 and 432000 require an addition to the encoding of numeric values.
+ cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
+ cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
+ -> uprops.h, uchar.c & UCharacterProperty.java
+ -> cucdtst.c & UCharacterTest.java
+
+* generate normalization data files
+- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
+- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
+- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
+- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+* build Unicode tools using CMake+make
+
+* generate core properties data files
+- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
+- in initial bootstrapping, change the UCA version
+ in source/data/unidata/FractionalUCA.txt to match the new Unicode version
+- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
+- rebuild ICU (make install) & tools
+ + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
+ check if the UCA version in FractionalUCA.txt matches the new Unicode version
+ (see step above)
+- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..6.2: U+2260, U+226E, U+226F
+- nothing new in 6.2, no test file to update
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt50l
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
+ echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
+- refresh ICU4J
+ ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* UCA
+
+- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
+- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ (note removing the underscore before "Rules")
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
+- check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out
+- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
+- run genuca, see command line above
+- rebuild ICU4C
+- refresh ICU4J collation data:
+ (subset of instructions above for properties data refresh, except copies all coll/*)
+ ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
+ ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
+ ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* test ICU, fix test code where necessary
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+*** LayoutEngine script information
+- skipped for Unicode 6.2: no new scripts
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+
+---------------------------------------------------------------------------- ***
+
+Future Unicode update
+
+Tools simplified since the Unicode 6.1 update. See
+- https://icu.unicode.org/design/props/ppucd
+- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
+
+* Unicode version numbers
+- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
+
+* file preparation
+- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
+- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+- Check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out.
+
+* PropertyValueAliases.txt changes
+- Script codes that are in ISO 15924 but not in Unicode are now listed in
+ preparseucd.py, in the _scripts_only_in_iso15924 variable.
+ If there are new ISO codes, then add them.
+ If Unicode adds some of them, then remove them from the .py variable.
+
+* UnicodeData.txt changes
+- No more manual changes for CJK ranges for algorithmic names;
+ those are now written to ppucd.txt and genprops reads them from there.
+
+* generate core properties data files (makeprops.sh was deleted)
+- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
+
+* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
+- it is now generated by preparseucd.py
+
+* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
+- it is now generated by preparseucd.py
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
+ (can be in some subfolder)
+
+* generate normalization data files
+- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
+- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
+- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
+- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+* build Unicode tools using CMake+make
+
+* new way to call genuca (makeuca.sh was deleted)
+- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
+
+---------------------------------------------------------------------------- ***
+
+Unicode 6.1 update
+
+*** ICU Trac
+
+- ticket 8995 final update to Unicode 6.1
+- ticket 8994 regenerate source/layout/CanonData.cpp
+
+- ticket 8961 support Unicode "Age" value *names*
+- ticket 8963 support multiple character name aliases & types
+
+- ticket 8827 "update ICU to Unicode 6.1"
+- C++ branches/markus/uni61 at r30864 from trunk at r30843
+- Java branches/markus/uni61 at r30865 from trunk at r30863
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+ (configure.in & configure: have been modified to extract the version from uchar.h)
+- com.ibm.icu.util.VersionInfo
+- icutools/unicode/makedefs.sh
+ + also review & update other definitions in that file,
+ e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
+
+*** data files & enums & parser code
+
+* file preparation
+
+~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
+- This prepares both unidata and testdata files in respective output subfolders.
+- Check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out.
+
+* PropertyValueAliases.txt changes
+- 11 new block names:
+ Arabic_Extended_A
+ Arabic_Mathematical_Alphabetic_Symbols
+ Chakma
+ Meetei_Mayek_Extensions
+ Meroitic_Cursive
+ Meroitic_Hieroglyphs
+ Miao
+ Sharada
+ Sora_Sompeng
+ Sundanese_Supplement
+ Takri
+ -> add to uchar.h
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- 1 new Joining_Group (jg) value:
+ Rohingya_Yeh
+ -> uchar.h & UCharacter.JoiningGroup
+- 2 new Line_Break (lb) values:
+ CJ=Conditional_Japanese_Starter
+ HL=Hebrew_Letter
+ -> uchar.h & UCharacter.LineBreak
+- 7 new scripts:
+ sc ; Cakm ; Chakma
+ sc ; Merc ; Meroitic_Cursive
+ sc ; Mero ; Meroitic_Hieroglyphs
+ sc ; Plrd ; Miao
+ sc ; Shrd ; Sharada
+ sc ; Sora ; Sora_Sompeng
+ sc ; Takr ; Takri
+ -> remove these from SyntheticPropertyValueAliases.txt
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
+ (added 2011-06-21)
+ Khoj 322 Khojki
+ Tirh 326 Tirhuta
+ and another one added 2011-12-09
+ Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
+ -> uscript.h
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2;\3
+ -> SyntheticPropertyValueAliases.txt
+ -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* UnicodeData.txt changes
+- the last Unihan code point changes from U+9FCB to U+9FCC
+ search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
+ + do change gennames.c
+ + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
+
+* DerivedBidiClass.txt changes
+- 2 new default-AL blocks:
+# Arabic Extended-A: U+08A0 - U+08FF (was default-R)
+# Arabic Mathematical Alphabetic Symbols:
+# U+1EE00 - U+1EEFF (was default-R)
+- 2 new default-R blocks:
+# Meroitic Hieroglyphs:
+# U+10980 - U+1099F
+# Meroitic Cursive: U+109A0 - U+109FF
+ -> should be picked up by the explicit data in the file
+
+* NameAliases.txt changes
+- from
+ # Each line has two fields
+ # First field: Code point
+ # Second field: Alias
+- to
+ # Each line has three fields, as described here:
+ #
+ # First field: Code point
+ # Second field: Alias
+ # Third field: Type
+- Also, the file previously allowed multiple aliases but only now does it
+ actually provide multiple, even multiple of the same type. For example,
+ FEFF;BYTE ORDER MARK;alternate
+ FEFF;BOM;abbreviation
+ FEFF;ZWNBSP;abbreviation
+- This breaks our gennames parser, unames.icu data structure, and API.
+ Fix gennames to only pick up "correction" aliases.
+ New ticket #8963 for further changes.
+
+* run genpname/preparse.pl (on Linux)
+ + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
+ + make sure that data.h is writable
+ + perl preparse.pl ~/svn.icu/trunk/src > out.txt
+ + preparse.pl shows no errors, out.txt Info and Warning lines look ok
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+* build Unicode tools (at least genpname) using CMake+make
+
+* run genpname
+ (builds both pnames.icu and propname_data.h)
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
+
+* build ICU (make install)
+* build Unicode tools using CMake+make
+
+* update source/data/unidata/norm2/nfkc_cf.txt
+- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
+
+* update source/data/unidata/norm2/uts46.txt
+- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
+ to ~/svn.icu/tools/trunk/src/unicode/py
+- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
+- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
+- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..6.1: U+2260, U+226E, U+226F
+- nothing new in 6.1, no test file to update
+
+* generate core properties data files
+- in initial bootstrapping, change the UCA version
+ in source/data/unidata/FractionalUCA.txt to match the new Unicode version
+- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU & tools
+ + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
+ check if the UCA version in FractionalUCA.txt matches the new Unicode version
+ (see step above)
+- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
+ ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU & tools
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt49l
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
+ echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
+ ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
+- refresh ICU4J
+ ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* test ICU so far, fix test code where necessary
+- temporarily ignore collation issues that look like UCA/UCD mismatches,
+ until UCA data is updated
+
+* UCA
+
+- get output from Mark's tools; look in
+ http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ (note removing the underscore before "Rules")
+- update (ICU)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
+- check test file diffs for previously commented-out, known-failing data lines;
+ probably need to keep those commented out
+- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
+- run makeuca.sh:
+ ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU4C
+- refresh ICU4J collation data:
+ (subset of instructions above for properties data refresh, except copies all coll/*)
+ ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+*** LayoutEngine script information
+
+(For details see the Unicode 5.2 change log below.)
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ The generated files have a current copyright date and "@draft" statement.
+
+- diff current <icu>/source/layout files vs. generated ones
+ ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ review and manually merge desired changes;
+ fix gratuitous changes, incorrect @draft and missing aliases;
+ Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
+- if you just copy the above files, then
+ fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
+ manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+
+---------------------------------------------------------------------------- ***
+
+ICU 4.8 (no Unicode update, just new script codes)
+
+* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
+ (added 2010-12-21)
+ Afak 439 Afaka
+ Jurc 510 Jurchen
+ Mroo 199 Mro, Mru
+ Nshu 499 Nüshu
+ Shrd 319 Sharada, Śāradā
+ Sora 398 Sora Sompeng
+ Takr 321 Takri, Ṭākrī, Ṭāṅkrī
+ Tang 520 Tangut
+ Wole 480 Woleai
+ -> uscript.h
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2;\3
+ -> genpname/SyntheticPropertyValueAliases.txt
+ -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+
+* run genpname/preparse.pl (on Linux)
+ + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
+ + make sure that data.h is writable
+ + perl preparse.pl ~/svn.icu/trunk/src > out.txt
+ + preparse.pl shows no errors, out.txt Info and Warning lines look ok
+
+* rebuild Unicode tools (at least genpname) using make
+- You might first need to "make install" ICU so that the tools build can pick
+ up the new definitions from the installed header files.
+
+* run genpname
+ (builds both pnames.icu and propname_data.h)
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
+- rebuild ICU & tools
+
+* run genprops
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
+- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
+- rebuild ICU & tools
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
+ ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
+ ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
+- refresh ICU4J
+ ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
+
+* should have updated the layout engine script codes but forgot
+
+---------------------------------------------------------------------------- ***
+
+Unicode 6.0 update
+
+*** related ICU Trac tickets
+
+7264 Unicode 6.0 Update
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+ (configure.in & configure: have been modified to extract the version from uchar.h)
+- com.ibm.icu.util.VersionInfo
+
+*** data files & enums & parser code
+
+* file preparation
+
+~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
+- This now prepares both unidata and testdata files in respective output subfolders.
+
+* PropertyAliases.txt changes
+- new Script_Extensions property defined in the new ScriptExtensions.txt file
+ but not listed in PropertyAliases.txt; reported to unicode.org;
+ -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
+ scx; Script_Extensions
+ -> uchar.h with new UProperty section
+ -> com.ibm.icu.lang.UProperty, parallel with uchar.h
+
+* PropertyValueAliases.txt changes
+- 12 new block names:
+ Alchemical_Symbols
+ Bamum_Supplement
+ Batak
+ Brahmi
+ CJK_Unified_Ideographs_Extension_D
+ Emoticons
+ Ethiopic_Extended_A
+ Kana_Supplement
+ Mandaic
+ Miscellaneous_Symbols_And_Pictographs
+ Playing_Cards
+ Transport_And_Map_Symbols
+ -> add to uchar.h
+ -> add to UCharacter.UnicodeBlock
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- Joining_Group (jg) values:
+ Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
+ -> uchar.h & UCharacter.JoiningGroup
+- 3 new scripts:
+ sc ; Batk ; Batak
+ sc ; Brah ; Brahmi
+ sc ; Mand ; Mandaic
+ -> remove these from SyntheticPropertyValueAliases.txt
+ -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
+ (added 2009-11-11..2010-07-18)
+ Bass 259 Bassa Vah
+ Dupl 755 Duployan shortand
+ Elba 226 Elbasan
+ Gran 343 Grantha
+ Kpel 436 Kpelle
+ Loma 437 Loma
+ Mend 438 Mende
+ Merc 101 Meroitic Cursive
+ Narb 106 Old North Arabian
+ Nbat 159 Nabataean
+ Palm 126 Palmyrene
+ Sind 318 Sindhi
+ Wara 262 Warang Citi
+ -> uscript.h
+ -> com.ibm.icu.lang.UScript
+ find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
+ replace public static final int \1 = \2;\3
+ -> SyntheticPropertyValueAliases.txt
+ -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- ISO 15924 name change
+ Mero 100 Meroitic Hieroglyphs (was Meroitic)
+ -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
+- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
+
+* UnicodeData.txt changes
+- new CJK block:
+ 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
+ 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
+ -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
+
+* build Unicode tools using CMake+make
+
+* run genpname/preparse.pl (on Linux)
+ + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
+ + make sure that data.h is writable
+ + perl preparse.pl ~/svn.icu/trunk/src > out.txt
+ + preparse.pl shows no errors, out.txt Info and Warning lines look ok
+
+* rebuild Unicode tools (at least genpname) using make
+- You might first need to "make install" ICU so that the tools build can pick
+ up the new definitions from the installed header files.
+
+* run genpname
+- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
+- rebuild ICU & tools
+
+* update source/data/unidata/norm2/nfkc_cf.txt
+- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
+
+* update source/data/unidata/norm2/uts46.txt
+- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
+ to ~/svn.icu/tools/trunk/src/unicode/py
+- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
+- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
+- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0: U+2260, U+226E, U+226F
+
+* generate core properties data files
+- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU & tools
+- run makeuca.sh so that genuca picks up the new nfc.nrm:
+ ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU & tools
+
+* implement new Script_Extensions property (provisional)
+- parser & generator: genprops & uprops.icu
+- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
+- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
+
+* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
+- (one-time change)
+- genbidi/gencase/genprops tools changes
+- re-run makeprops.sh (see above)
+- UCharacterProperty.java, UCharacterTypeIterator.java,
+ UBiDiProps.java, UCaseProps.java, and several others with minor changes;
+ UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
+
+* update Java data files
+- refresh just the UCD-related files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt45l
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
+ echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
+ ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
+- refresh ICU4J
+ ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* un-hardcode normalization skippable (NF*_Inert) test data
+- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
+
+* copy updated break iterator test files
+- now handled by early ucdcopy.py and
+ copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
+ (old instructions:
+ copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
+ to ~/svn.icu/trunk/src/source/test/testdata)
+- they are not used in ICU4J
+
+* UCA
+
+- get output from Mark's tools; look in
+ http://www.unicode.org/~book/incoming/mark/uca6.0.0/
+ http://www.macchiato.com/unicode/utc/additional-uca-files
+ http://www.unicode.org/Public/UCA/6.0.0/
+ http://www.unicode.org/~mdavis/uca/
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+- update Han-implicit ranges for new CJK extensions:
+ swapCJK() in ucol.cpp & ImplicitCEGenerator.java
+- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
+ do not add it into invuca so that tailoring primary-after an ignorable works
+- genuca: permit space between [variable top] bytes
+- ucol.cpp: treat noncharacters like unassigned rather than ignorable
+- run makeuca.sh:
+ ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
+- rebuild ICU4C
+- refresh ICU4J collation data:
+ (subset of instructions above for properties data refresh, except copies all coll/*)
+ ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
+ ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
+- update (ICU)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ with output from Mark's Unicode tools
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+*** LayoutEngine script information
+
+(For details see the Unicode 5.2 change log below.)
+
+* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
+ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
+ScriptRunData.cpp, which is no longer needed.)
+
+The generated files have a current copyright date and "@draft" statement.
+
+* copy the above files into <icu>/source/layout, replacing the old files.
+* fix mixed line endings
+* review the diffs and fix incorrect @draft and missing aliases;
+ Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
+* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
+
+---------------------------------------------------------------------------- ***
+
+Unicode 5.2 update
+
+*** related ICU Trac tickets
+
+7084 Unicode 5.2
+
+7167 verify collation bytes
+7235 Java test NAME_ALIAS
+7236 Java DerivedCoreProperties.txt test
+7237 Java BidiTest.txt
+7238 UTrie2 in core unidata
+7239 test for tailoring gaps
+7240 Java fix CollationMiscTest
+7243 update layout engine for Unicode 5.2
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- configure.in & configure
+- update ucdVersion in gennames.c if an algorithmic range changes
+
+*** data files & enums & parser code
+
+* file preparation
+
+python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
+- includes finding files regardless of version numbers,
+ copying them, and performing the equivalent processing of the
+ ucdstrip and ucdmerge tools on the desired set of files
+
+* notes on changes
+- PropertyAliases.txt
+ moved from numeric to enumerated:
+ ccc ; Canonical_Combining_Class
+ new string properties:
+ NFKC_CF ; NFKC_Casefold
+ Name_Alias; Name_Alias
+ new binary properties:
+ Cased ; Cased
+ CI ; Case_Ignorable
+ CWCF ; Changes_When_Casefolded
+ CWCM ; Changes_When_Casemapped
+ CWKCF ; Changes_When_NFKC_Casefolded
+ CWL ; Changes_When_Lowercased
+ CWT ; Changes_When_Titlecased
+ CWU ; Changes_When_Uppercased
+ new CJK Unihan properties (not supported by ICU)
+- PropertyValueAliases.txt
+ new block names
+ new scripts
+ one script code change:
+ sc ; Qaai ; Inherited
+ ->
+ sc ; Zinh ; Inherited ; Qaai
+ new Line_Break (lb) value:
+ lb ; CP ; Close_Parenthesis
+ new Joining_Group (jg) values: Farsi_Yeh, Nya
+ other new values:
+ ccc; 214; ATA ; Attached_Above
+- DerivedBidiClass.txt
+ new default-R range: U+1E800 - U+1EFFF
+- UnicodeData.txt
+ all of the ISO comments are gone
+ new CJK block end:
+ 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
+ new CJK block:
+ 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
+ 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
+
+* genpname
+- run preparse.pl
+ + cd \svn\icuproj\icu\trunk\source\tools\genpname
+ + make sure that data.h is writable
+ + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
+ + preparse.pl complains with errors like the following:
+ Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
+ This is because ICU 4.0 had scripts from ISO 15924 which are now
+ added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
+ and PropertyValueAliases.txt.
+ -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
+ Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
+ + preparse.pl complains with errors about block names missing from uchar.h; add them
+
+* uchar.h & uscript.h & uprops.h & uprops.c & genprops
+- new block & script values
+ + 26 new blocks
+ copy new blocks from Blocks.txt
+ MS VC++ 2008 regular expression:
+ find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
+ replace with " UBLOCK_\3 = 172, /*[\1]*/"
+ + several new script values already added in ICU 4.0 for ISO 15924 coverage
+ (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
+ + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
+ + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
+ (added to SyntheticPropertyValueAliases.txt)
+- new Joining Group (JG) values: Farsi_Yeh, Nya
+- new Line_Break (lb) value:
+ lb ; CP ; Close_Parenthesis
+
+* hardcoded Unihan range end/limit
+- Unihan range end moves from 9FC3 to 9FCB
+ search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
+ + do change gennames.c
+
+* Compare definitions of new binary properties with what we used to use
+ in algorithms, to see if the definitions changed.
+- Verified that definitions for Cased and Case_Ignorable are unchanged.
+ The gencase tool now parses the newly public Case_Ignorable values
+ in case the definition changes in the future.
+
+* uchar.c & uprops.h & uprops.c & genprops
+- new numeric values that didn't exist in Unicode data before:
+ 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
+ the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
+ therefore redesign the encoding of numeric types and values for formatVersion 6;
+ design for simple numbers up to at least 144 ("one gross"),
+ large values up to at least 10^20,
+ and fractions with numerators -1..17 and denominators 1..16
+ to cover current and expected future values
+ (e.g., more Han numeric values, Meroitic twelfths)
+
+* reimplement Hangul_Syllable_Type for new Jamo characters
+- the old code assumed that all Jamo characters are in the 11xx block
+- Unicode 5.2 fills holes there and adds new Jamo characters in
+ A960..A97F; Hangul Jamo Extended-A
+ and in
+ D7B0..D7FF; Hangul Jamo Extended-B
+- Hangul_Syllable_Type can be trivially derived from a subset of
+ Grapheme_Cluster_Break values
+
+* build Unicode data source code for hardcoding core data
+C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
+
+ICU data make path is \svn\icuproj\icu\trunk\source\data\
+ICU root path is \svn\icuproj\icu\trunk
+Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
+Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
+Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
+Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
+Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
+Creating data file for Unicode Property Names
+Creating data file for Unicode Character Properties
+Creating data file for Unicode Case Mapping Properties
+Creating data file for Unicode BiDi/Shaping Properties
+Creating data file for Unicode Normalization
+Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
+Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
+
+- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
+ and rebuild the common library
+
+*** UCA
+
+- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
+- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
+[ Begin obsolete instructions:
+ Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
+ - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
+ on Windows:
+ python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
+ python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
+ End obsolete instructions]
+- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
+ not just the *_STUB.txt files
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+
+*** Implement Cased & Case_Ignorable properties
+- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
+- Problem: These properties should be disjoint, but aren't
+- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
+- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
+
+*** Implement Changes_When_Xyz properties
+- without stored data
+
+*** Implement Name_Alias property
+- add it as another name field in unames.icu
+- make it available via u_charName() and UCharNameChoice and
+- consider it in u_charFromName()
+
+*** Break iterators
+
+* Update break iterator rules to new UAX versions and new property values
+* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
+
+*** new BidiTest file
+- review format and data
+- copy BidiTest.txt to source/test/testdata
+- write test code using this data
+- fix ICU code where it fails the conformance test
+
+*** Java
+- generally, find and update code corresponding to C/C++
+- UCharacter.UnicodeBlock constants:
+ a) add an _ID integer per new block, update COUNT
+ b) add a class instance per new block
+ Visual Studio regex:
+ find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
+ replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
+
+- port test changes to Java
+
+*** LayoutEngine script information
+
+(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
+
+* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
+ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
+ScriptRunData.cpp, which is no longer needed.)
+
+The generated files have a current copyright date and "@draft" statement.
+
+-> Eric Mader wrote in email on 20090930:
+ "I think the tool has been modified to update @draft to @stable for
+ older scripts and to add @draft for new scripts.
+ (I worked with an intern on this last year.)
+ You should check the output after you run it."
+
+* copy the above files into <icu>/source/layout, replacing the old files.
+* fix mixed line endings
+* review the diffs and fix incorrect @draft and missing aliases
+* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
+
+Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
+and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
+
+-> Eric Mader wrote in email on 20090930:
+ "This is just a matter of making sure that all the per-script tables have
+ entries for any new scripts that were added.
+ If any new Indic characters were added, then the class tables in
+ IndicClassTables.cpp should be updated to reflect this.
+ John Emmons should know how to do this if it's required."
+
+* rebuild the layout and layoutex libraries.
+
+*** Documentation
+- Update User Guide
+ + Jamo_Short_Name, sfc->scf, binary property value aliases
+
+---------------------------------------------------------------------------- ***
+
+Unicode 5.1 update
+
+*** related ICU Trac tickets
+
+5696 Update to Unicode 5.1
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- configure.in & configure
+- update ucdVersion in gennames.c if an algorithmic range changes
+
+*** data files & enums & parser code
+
+* file preparation
+- ucdstrip:
+ DerivedCoreProperties.txt
+ DerivedNormalizationProps.txt
+ NormalizationTest.txt
+ PropList.txt
+ Scripts.txt
+ GraphemeBreakProperty.txt
+ SentenceBreakProperty.txt
+ WordBreakProperty.txt
+- ucdstrip and ucdmerge:
+ EastAsianWidth.txt
+ LineBreak.txt
+
+* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
+copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
+copy 5.1.0\ucd\Blocks.txt ..\unidata\
+copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
+copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
+copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
+copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
+copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
+copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
+copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
+copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
+copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
+copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
+copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
+
+ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
+ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
+ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
+ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
+ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
+ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
+ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
+ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
+ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
+ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
+
+* genpname
+- run preparse.pl
+ + cd \svn\icuproj\icu\uni51\source\tools\genpname
+ + make sure that data.h is writable
+ + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
+ + preparse.pl complains with errors like the following:
+ Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
+ This is because ICU 3.8 had scripts from ISO 15924 which are now
+ added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
+ and PropertyValueAliases.txt.
+ -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
+ Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
+ + PropertyValueAliases.txt now explicitly contains values for boolean properties:
+ N/Y, No/Yes, F/T, False/True
+ -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
+ It will use further values from the file if present.
+
+* uchar.h & uscript.h & uprops.h & uprops.c & genprops
+- new block & script values
+ + 17 new blocks
+ + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
+ (removed from SyntheticPropertyValueAliases.txt)
+ + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
+ (added to SyntheticPropertyValueAliases.txt)
+- uprops.icu (uprops.h) only provides 7 bits for script codes.
+ In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
+ There is none above 127 yet which is the script code for an
+ assigned Unicode character, so ICU 4.0 uprops.icu does not store any
+ script code values greater than 127.
+ However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
+ in a parallel bit field, and that overflows now.
+ Also, future values >=128 would be incompatible anyway.
+ uprops.h is modified to move around several of the bit fields
+ in the properties vector words, and now uses 8 bits for the script code.
+ Two other bit fields also grow to accommodate future growth:
+ Block (current count: 172) grows from 8 to 9 bits,
+ and Word_Break grows from 4 to 5 bits.
+- renamed property Simple_Case_Folding (sfc->scf)
+ + nothing to be done: handled as normal alias
+- new property JSN Jamo_Short_Name
+ + no new API: only contributes to the Name property
+- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
+- new Joining Group (JG) value: Burushashki_Yeh_Barree
+- new Sentence_Break (SB) values:
+ SB ; CR ; CR
+ SB ; EX ; Extend
+ SB ; LF ; LF
+ SB ; SC ; SContinue
+- new Word_Break (WB) values:
+ WB ; CR ; CR
+ WB ; Extend ; Extend
+ WB ; LF ; LF
+ WB ; MB ; MidNumLet
+
+* Further changes in the 2008-02-29 update:
+- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
+ because they should not normally be invisible.
+- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
+- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
+- new Word_Break (WB) value: NL=Newline
+
+* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
+- Unihan range end moves from 9FBB to 9FC3
+ search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
+ + do change gennames.c
+
+* build Unicode data source code for hardcoding core data
+C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
+
+ICU data make path is \svn\icuproj\icu\uni51\source\data\
+ICU root path is \svn\icuproj\icu\uni51
+Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
+Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
+Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
+Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
+Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
+Creating data file for Unicode Character Properties
+Creating data file for Unicode Case Mapping Properties
+Creating data file for Unicode BiDi/Shaping Properties
+Creating data file for Unicode Normalization
+Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
+Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
+
+- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
+ and rebuild the common library
+
+*** Break iterators
+
+* Update break iterator rules to new UAX versions and new property values
+
+*** UCA
+
+* update FractionalUCA.txt and UCARules.txt with new canonical closure
+
+*** Test suites
+- Test that APIs using Unicode property value aliases (like UnicodeSet)
+ support all of the boolean values N/Y, No/Yes, F/T, False/True
+ -> TestBinaryValues() tests in both cintltst and intltest
+
+*** LayoutEngine script information
+* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
+ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
+ScriptRunData.cpp, which is no longer needed.)
+
+The generated files have a current copyright date and "@draft" statement.
+
+* copy the above files into <icu>/source/layout, replacing the old files.
+
+Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
+and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
+
+* rebuild the layout and layoutex libraries.
+
+*** Documentation
+- Update User Guide
+ + Jamo_Short_Name, sfc->scf, binary property value aliases
+
+---------------------------------------------------------------------------- ***
+
+Unicode 5.0 update
+
+*** related Jitterbugs
+
+5084 RFE: Update to Unicode 5.0
+
+*** data files & enums & parser code
+
+* file preparation
+- ucdstrip:
+ DerivedCoreProperties.txt
+ DerivedNormalizationProps.txt
+ NormalizationTest.txt
+ PropList.txt
+ Scripts.txt
+ GraphemeBreakProperty.txt
+ SentenceBreakProperty.txt
+ WordBreakProperty.txt
+- ucdstrip and ucdmerge:
+ EastAsianWidth.txt
+ LineBreak.txt
+
+* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
+copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
+copy 5.0.0\ucd\Blocks.txt ..\unidata\
+copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
+copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
+copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
+copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
+copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
+copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
+copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
+copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
+copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
+copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
+copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
+
+ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
+ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
+ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
+ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
+ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
+ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
+ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
+ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
+ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
+ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
+
+* update FractionalUCA.txt and UCARules.txt with new canonical closure
+
+* genpname
+- run preparse.pl
+ + make sure that data.h is writable
+ + perl preparse.pl \cvs\oss\icu > out.txt
+
+* uchar.h & uscript.h & uprops.h & uprops.c & genprops
+- new block & script values
+ + script values already added in ICU 3.6 because all of ISO 15924 is now covered
+
+* build Unicode data source code for hardcoding core data
+C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
+
+ICU data make path is \cvs\oss\icu\source\data\
+ICU root path is \cvs\oss\icu
+Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
+[etc.]
+Creating data file for Unicode Character Properties
+Creating data file for Unicode Case Mapping Properties
+Creating data file for Unicode BiDi/Shaping Properties
+Creating data file for Unicode Normalization
+Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
+Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
+
+- copy the .c source files to C:\cvs\oss\icu\source\common
+ and rebuild the common library
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- configure.in
+
+*** LayoutEngine script information
+* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
+ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
+ScriptRunData.cpp, which is no longer needed.)
+
+The generated files have a current copyright date and "@draft" statement.
+
+* copy the above files into <icu>/source/layout, replacing the old files.
+
+Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
+and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
+
+* rebuild the layout and layoutex libraries.
+
+---------------------------------------------------------------------------- ***
+
+Unicode 4.1 update
+
+*** related Jitterbugs
+
+4332 RFE: Update to Unicode 4.1
+4157 RBBI, TR29 4.1 updates
+
+*** data files & enums & parser code
+
+* file preparation
+- ucdstrip:
+ DerivedCoreProperties.txt
+ DerivedNormalizationProps.txt
+ NormalizationTest.txt
+ GraphemeBreakProperty.txt
+ SentenceBreakProperty.txt
+ WordBreakProperty.txt
+- ucdstrip and ucdmerge:
+ EastAsianWidth.txt
+ LineBreak.txt
+
+* add new files to the repository
+ GraphemeBreakProperty.txt
+ SentenceBreakProperty.txt
+ WordBreakProperty.txt
+
+* update FractionalUCA.txt and UCARules.txt with new canonical closure
+
+* genpname
+- handle new enumerated properties in sub read_uchar
+- run preparse.pl
+
+* uchar.h & uscript.h & uprops.h & uprops.c & genprops
+- new binary properties
+ + Pattern_Syntax
+ + Pattern_White_Space
+- new enumerated properties
+ + Grapheme_Cluster_Break
+ + Sentence_Break
+ + Word_Break
+- new block & script & line break values
+
+* gencase
+- case-ignorable changes
+ see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
+ now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- configure.in
+
+*** tests
+- verify that u_charMirror() round-trips
+- test all new properties and some new values of old properties
+
+*** other code
+
+* hardcoded Unihan range end/limit
+- Unihan range end moves from 9FA5 to 9FBB
+ search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
+ + do not modify BOCU/BOCSU code because that would change the encoding
+ and break binary compatibility!
+ + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
+ NamePrepProfile.txt
+ + ignore trietest.c: test data is arbitrary
+ + ignore tstnorm.cpp: test optimization, not important
+ + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
+ + do change line_th.txt and word_th.txt
+ by replacing hardcoded ranges with the new property values
+ + do change gennames.c
+
+source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
+source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
+source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
+
+* case mappings
+- compare new special casing context conditions with previous ones
+ see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
+
+* genpname
+- consider storing only the short name if it is the same as the long name
+
+*** other reviews
+- UAX #29 changes (grapheme/word/sentence breaks)
+- UAX #14 changes (line breaks)
+- Pattern_Syntax & Pattern_White_Space
+
+---------------------------------------------------------------------------- ***
+
+Unicode 4.0.1 update
+
+*** related Jitterbugs
+
+3170 RFE: Update to Unicode 4.0.1
+3171 Add new Unicode 4.0.1 properties
+3520 use Unicode 4.0.1 updates for break iteration
+
+*** data files & enums & parser code
+
+* file preparation
+- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
+- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
+
+* file fixes
+- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
+ according to PRI #26
+ http://www.unicode.org/review/resolved-pri.html#pri26
+- undone again because no corrigendum in sight;
+ instead modified tests to not check consistency on this for Unicode 4.0.1
+
+* ucdterms.txt
+- update from http://www.unicode.org/copyright.html
+ formatted for plain text
+
+* uchar.h & uprops.h & uprops.c & genprops
+- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
+- add U_LB_INSEPARABLE due to a spelling fix
+ + put short name comment only on line with new constant
+ for genpname perl script parser
+- new binary properties
+ + STerm
+ + Variation_Selector
+
+* genpname
+- fix genpname perl script so that it doesn't choke on more than 2 names per property value
+- perl script: correctly calculate the maximum number of fields per row
+
+* uscript.h
+- new script code Hrkt=Katakana_Or_Hiragana
+
+* gennorm.c track changes in DerivedNormalizationProps.txt
+- "FNC" -> "FC_NFKC"
+- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
+
+* genprops/props2.c track changes in DerivedNumericValues.txt
+- changed from 3 columns to 2, dropping the numeric type
+ + assume that the type is always numeric for Han characters,
+ and that only those are added in addition to what UnicodeData.txt lists
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- configure.in
+
+*** tests
+- update test of default bidi classes according to PRI #28
+ /tsutil/cucdtst/TestUnicodeData
+ http://www.unicode.org/review/resolved-pri.html#pri28
+- bidi tests: change exemplar character for ES depending on Unicode version
+- change hardcoded expected property values where they change
+
+*** other code
+
+* name matching
+- read UCD.html
+
+* scripts
+- use new Hrkt=Katakana_Or_Hiragana
+
+* ZWJ & ZWNJ
+- are now part of combining character sequences
+- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ