1 files changed, 267 insertions, 0 deletions
diff --git a/libraries/liblunicode/ucdata/format.txt b/libraries/liblunicode/ucdata/format.txt
new file mode 100644
index 0000000..e285b39
--- /dev/null
+++ b/libraries/liblunicode/ucdata/format.txt
@@ -0,0 +1,267 @@
+#
+# $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $
+#
+
+CHARACTER DATA
+==============
+
+This package generates some data files that contain character properties useful
+for text processing.
+
+CHARACTER PROPERTIES
+====================
+
+The first data file is called "ctype.dat" and contains a compressed form of
+the character properties found in the Unicode Character Database (UCDB).
+Additional properties can be specified in limited UCDB format in another file
+to avoid modifying the original UCDB.
+
+The following is a property name and code table to be used with the character
+data:
+
+NAME CODE DESCRIPTION
+---------------------
+Mn   0    Mark, Non-Spacing
+Mc   1    Mark, Spacing Combining
+Me   2    Mark, Enclosing
+Nd   3    Number, Decimal Digit
+Nl   4    Number, Letter
+No   5    Number, Other
+Zs   6    Separator, Space
+Zl   7    Separator, Line
+Zp   8    Separator, Paragraph
+Cc   9    Other, Control
+Cf   10   Other, Format
+Cs   11   Other, Surrogate
+Co   12   Other, Private Use
+Cn   13   Other, Not Assigned
+Lu   14   Letter, Uppercase
+Ll   15   Letter, Lowercase
+Lt   16   Letter, Titlecase
+Lm   17   Letter, Modifier
+Lo   18   Letter, Other
+Pc   19   Punctuation, Connector
+Pd   20   Punctuation, Dash
+Ps   21   Punctuation, Open
+Pe   22   Punctuation, Close
+Po   23   Punctuation, Other
+Sm   24   Symbol, Math
+Sc   25   Symbol, Currency
+Sk   26   Symbol, Modifier
+So   27   Symbol, Other
+L    28   Left-To-Right
+R    29   Right-To-Left
+EN   30   European Number
+ES   31   European Number Separator
+ET   32   European Number Terminator
+AN   33   Arabic Number
+CS   34   Common Number Separator
+B    35   Block Separator
+S    36   Segment Separator
+WS   37   Whitespace
+ON   38   Other Neutrals
+Pi   47   Punctuation, Initial
+Pf   48   Punctuation, Final
+#
+# Implementation specific properties.
+#
+Cm   39   Composite
+Nb   40   Non-Breaking
+Sy   41   Symmetric (characters which are part of open/close pairs)
+Hd   42   Hex Digit
+Qm   43   Quote Mark
+Mr   44   Mirroring
+Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
+Cp   46   Defined character
+
+The actual binary data is formatted as follows:
+
+  Assumptions: unsigned short is at least 16-bits in size and unsigned long
+               is at least 32-bits in size.
+
+    unsigned short ByteOrderMark
+    unsigned short OffsetArraySize
+    unsigned long  Bytes
+    unsigned short Offsets[OffsetArraySize + 1]
+    unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
+
+  The Bytes field provides the total byte count used for the Offsets[] and
+  Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
+  there is always one extra node on the end to hold the final index of the
+  Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
+  representing a range of Unicode characters.  The pairs are arranged in
+  increasing order by the first character code in the range.
+
+  Determining if a particular character is in the property list requires a
+  simple binary search to determine if a character is in any of the ranges
+  for the property.
+
+  If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
+  machine with a different endian order and the values must be byte-swapped.
+
+  To swap a 16-bit value:
+     c = (c >> 8) | ((c & 0xff) << 8)
+
+  To swap a 32-bit value:
+     c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
+         (((c >> 16) & 0xff) << 8) | (c >> 24)
+
+CASE MAPPINGS
+=============
+
+The next data file is called "case.dat" and contains three case mapping tables
+in the following order: upper, lower, and title case.  Each table is in
+increasing order by character code and each mapping contains 3 unsigned longs
+which represent the possible mappings.
+
+The format for the binary form of these tables is:
+
+  unsigned short ByteOrderMark
+  unsigned short NumMappingNodes, count of all mapping nodes
+  unsigned short CaseTableSizes[2], upper and lower mapping node counts
+  unsigned long  CaseTables[NumMappingNodes]
+
+  The starting indexes of the case tables are calculated as following:
+
+    UpperIndex = 0;
+    LowerIndex = CaseTableSizes[0] * 3;
+    TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
+
+  The order of the fields for the three tables are:
+
+    Upper case
+    ----------
+    unsigned long upper;
+    unsigned long lower;
+    unsigned long title;
+
+    Lower case
+    ----------
+    unsigned long lower;
+    unsigned long upper;
+    unsigned long title;
+
+    Title case
+    ----------
+    unsigned long title;
+    unsigned long upper;
+    unsigned long lower;
+
+  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+  same way as described in the CHARACTER PROPERTIES section.
+
+  Because the tables are in increasing order by character code, locating a
+  mapping requires a simple binary search on one of the 3 codes that make up
+  each node.
+
+  It is important to note that there can only be 65536 mapping nodes which
+  divided into 3 portions allows 21845 nodes for each case mapping table.  The
+  distribution of mappings may be more or less than 21845 per table, but only
+  65536 are allowed.
+
+COMPOSITIONS
+============
+
+This data file is called "comp.dat" and contains data that tracks character
+pairs that have a single Unicode value representing the combination of the two
+characters.
+
+The format for the binary form of this table is:
+
+  unsigned short ByteOrderMark
+  unsigned short NumCompositionNodes, count of composition nodes
+  unsigned long  Bytes, total number of bytes used for composition nodes
+  unsigned long  CompositionNodes[NumCompositionNodes * 4]
+
+  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+  same way as described in the CHARACTER PROPERTIES section.
+
+  The CompositionNodes[] array consists of groups of 4 unsigned longs.  The
+  first of these is the character code representing the combination of two
+  other character codes, the second records the number of character codes that
+  make up the composition (not currently used), and the last two are the pair
+  of character codes whose combination is represented by the character code in
+  the first field.
+
+DECOMPOSITIONS
+==============
+
+The next data file is called "decomp.dat" and contains the decomposition data
+for all characters with decompositions containing more than one character and
+are *not* compatibility decompositions.  Compatibility decompositions are
+signaled in the UCDB format by the use of the <compat> tag in the
+decomposition field.  Each list of character codes represents a full
+decomposition of a composite character.  The nodes are arranged in increasing
+order by character code.
+
+The format for the binary form of this table is:
+
+  unsigned short ByteOrderMark
+  unsigned short NumDecompNodes, count of all decomposition nodes
+  unsigned long  Bytes
+  unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
+  unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
+
+  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+  same way as described in the CHARACTER PROPERTIES section.
+
+  The DecompNodes[] array consists of pairs of unsigned longs, the first of
+  which is the character code and the second is the initial index of the list
+  of character codes representing the decomposition.
+
+  Locating the decomposition of a composite character requires a binary search
+  for a character code in the DecompNodes[] array and using its index to
+  locate the start of the decomposition.  The length of the decomposition list
+  is the index in the following element in DecompNode[] minus the current
+  index.
+
+COMBINING CLASSES
+=================
+
+The fourth data file is called "cmbcl.dat" and contains the characters with
+non-zero combining classes.
+
+The format for the binary form of this table is:
+
+  unsigned short ByteOrderMark
+  unsigned short NumCCLNodes
+  unsigned long  Bytes
+  unsigned long  CCLNodes[NumCCLNodes * 3]
+
+  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+  same way as described in the CHARACTER PROPERTIES section.
+
+  The CCLNodes[] array consists of groups of three unsigned longs.  The first
+  and second are the beginning and ending of a range and the third is the
+  combining class of that range.
+
+  If a character is not found in this table, then the combining class is
+  assumed to be 0.
+
+  It is important to note that only 65536 distinct ranges plus combining class
+  can be specified because the NumCCLNodes is usually a 16-bit number.
+
+NUMBER TABLE
+============
+
+The final data file is called "num.dat" and contains the characters that have
+a numeric value associated with them.
+
+The format for the binary form of the table is:
+
+  unsigned short ByteOrderMark
+  unsigned short NumNumberNodes
+  unsigned long  Bytes
+  unsigned long  NumberNodes[NumNumberNodes]
+  unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
+                            / sizeof(short)]
+
+  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+  same way as described in the CHARACTER PROPERTIES section.
+
+  The NumberNodes array contains pairs of values, the first of which is the
+  character code and the second an index into the ValueNodes array.  The
+  ValueNodes array contains pairs of integers which represent the numerator
+  and denominator of the numeric value of the character.  If the character
+  happens to map to an integer, both the values in ValueNodes will be the
+  same.