summaryrefslogtreecommitdiffstats
path: root/libraries/liblunicode/ucdata/format.txt
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--libraries/liblunicode/ucdata/format.txt267
1 files changed, 267 insertions, 0 deletions
diff --git a/libraries/liblunicode/ucdata/format.txt b/libraries/liblunicode/ucdata/format.txt
new file mode 100644
index 0000000..e285b39
--- /dev/null
+++ b/libraries/liblunicode/ucdata/format.txt
@@ -0,0 +1,267 @@
+#
+# $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $
+#
+
+CHARACTER DATA
+==============
+
+This package generates some data files that contain character properties useful
+for text processing.
+
+CHARACTER PROPERTIES
+====================
+
+The first data file is called "ctype.dat" and contains a compressed form of
+the character properties found in the Unicode Character Database (UCDB).
+Additional properties can be specified in limited UCDB format in another file
+to avoid modifying the original UCDB.
+
+The following is a property name and code table to be used with the character
+data:
+
+NAME CODE DESCRIPTION
+---------------------
+Mn 0 Mark, Non-Spacing
+Mc 1 Mark, Spacing Combining
+Me 2 Mark, Enclosing
+Nd 3 Number, Decimal Digit
+Nl 4 Number, Letter
+No 5 Number, Other
+Zs 6 Separator, Space
+Zl 7 Separator, Line
+Zp 8 Separator, Paragraph
+Cc 9 Other, Control
+Cf 10 Other, Format
+Cs 11 Other, Surrogate
+Co 12 Other, Private Use
+Cn 13 Other, Not Assigned
+Lu 14 Letter, Uppercase
+Ll 15 Letter, Lowercase
+Lt 16 Letter, Titlecase
+Lm 17 Letter, Modifier
+Lo 18 Letter, Other
+Pc 19 Punctuation, Connector
+Pd 20 Punctuation, Dash
+Ps 21 Punctuation, Open
+Pe 22 Punctuation, Close
+Po 23 Punctuation, Other
+Sm 24 Symbol, Math
+Sc 25 Symbol, Currency
+Sk 26 Symbol, Modifier
+So 27 Symbol, Other
+L 28 Left-To-Right
+R 29 Right-To-Left
+EN 30 European Number
+ES 31 European Number Separator
+ET 32 European Number Terminator
+AN 33 Arabic Number
+CS 34 Common Number Separator
+B 35 Block Separator
+S 36 Segment Separator
+WS 37 Whitespace
+ON 38 Other Neutrals
+Pi 47 Punctuation, Initial
+Pf 48 Punctuation, Final
+#
+# Implementation specific properties.
+#
+Cm 39 Composite
+Nb 40 Non-Breaking
+Sy 41 Symmetric (characters which are part of open/close pairs)
+Hd 42 Hex Digit
+Qm 43 Quote Mark
+Mr 44 Mirroring
+Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
+Cp 46 Defined character
+
+The actual binary data is formatted as follows:
+
+ Assumptions: unsigned short is at least 16-bits in size and unsigned long
+ is at least 32-bits in size.
+
+ unsigned short ByteOrderMark
+ unsigned short OffsetArraySize
+ unsigned long Bytes
+ unsigned short Offsets[OffsetArraySize + 1]
+ unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
+
+ The Bytes field provides the total byte count used for the Offsets[] and
+ Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
+ there is always one extra node on the end to hold the final index of the
+ Ranges[] array. The Ranges[] array contains pairs of 4-byte values
+ representing a range of Unicode characters. The pairs are arranged in
+ increasing order by the first character code in the range.
+
+ Determining if a particular character is in the property list requires a
+ simple binary search to determine if a character is in any of the ranges
+ for the property.
+
+ If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
+ machine with a different endian order and the values must be byte-swapped.
+
+ To swap a 16-bit value:
+ c = (c >> 8) | ((c & 0xff) << 8)
+
+ To swap a 32-bit value:
+ c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
+ (((c >> 16) & 0xff) << 8) | (c >> 24)
+
+CASE MAPPINGS
+=============
+
+The next data file is called "case.dat" and contains three case mapping tables
+in the following order: upper, lower, and title case. Each table is in
+increasing order by character code and each mapping contains 3 unsigned longs
+which represent the possible mappings.
+
+The format for the binary form of these tables is:
+
+ unsigned short ByteOrderMark
+ unsigned short NumMappingNodes, count of all mapping nodes
+ unsigned short CaseTableSizes[2], upper and lower mapping node counts
+ unsigned long CaseTables[NumMappingNodes]
+
+ The starting indexes of the case tables are calculated as following:
+
+ UpperIndex = 0;
+ LowerIndex = CaseTableSizes[0] * 3;
+ TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
+
+ The order of the fields for the three tables are:
+
+ Upper case
+ ----------
+ unsigned long upper;
+ unsigned long lower;
+ unsigned long title;
+
+ Lower case
+ ----------
+ unsigned long lower;
+ unsigned long upper;
+ unsigned long title;
+
+ Title case
+ ----------
+ unsigned long title;
+ unsigned long upper;
+ unsigned long lower;
+
+ If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+ same way as described in the CHARACTER PROPERTIES section.
+
+ Because the tables are in increasing order by character code, locating a
+ mapping requires a simple binary search on one of the 3 codes that make up
+ each node.
+
+ It is important to note that there can only be 65536 mapping nodes which
+ divided into 3 portions allows 21845 nodes for each case mapping table. The
+ distribution of mappings may be more or less than 21845 per table, but only
+ 65536 are allowed.
+
+COMPOSITIONS
+============
+
+This data file is called "comp.dat" and contains data that tracks character
+pairs that have a single Unicode value representing the combination of the two
+characters.
+
+The format for the binary form of this table is:
+
+ unsigned short ByteOrderMark
+ unsigned short NumCompositionNodes, count of composition nodes
+ unsigned long Bytes, total number of bytes used for composition nodes
+ unsigned long CompositionNodes[NumCompositionNodes * 4]
+
+ If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+ same way as described in the CHARACTER PROPERTIES section.
+
+ The CompositionNodes[] array consists of groups of 4 unsigned longs. The
+ first of these is the character code representing the combination of two
+ other character codes, the second records the number of character codes that
+ make up the composition (not currently used), and the last two are the pair
+ of character codes whose combination is represented by the character code in
+ the first field.
+
+DECOMPOSITIONS
+==============
+
+The next data file is called "decomp.dat" and contains the decomposition data
+for all characters with decompositions containing more than one character and
+are *not* compatibility decompositions. Compatibility decompositions are
+signaled in the UCDB format by the use of the <compat> tag in the
+decomposition field. Each list of character codes represents a full
+decomposition of a composite character. The nodes are arranged in increasing
+order by character code.
+
+The format for the binary form of this table is:
+
+ unsigned short ByteOrderMark
+ unsigned short NumDecompNodes, count of all decomposition nodes
+ unsigned long Bytes
+ unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
+ unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
+
+ If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+ same way as described in the CHARACTER PROPERTIES section.
+
+ The DecompNodes[] array consists of pairs of unsigned longs, the first of
+ which is the character code and the second is the initial index of the list
+ of character codes representing the decomposition.
+
+ Locating the decomposition of a composite character requires a binary search
+ for a character code in the DecompNodes[] array and using its index to
+ locate the start of the decomposition. The length of the decomposition list
+ is the index in the following element in DecompNode[] minus the current
+ index.
+
+COMBINING CLASSES
+=================
+
+The fourth data file is called "cmbcl.dat" and contains the characters with
+non-zero combining classes.
+
+The format for the binary form of this table is:
+
+ unsigned short ByteOrderMark
+ unsigned short NumCCLNodes
+ unsigned long Bytes
+ unsigned long CCLNodes[NumCCLNodes * 3]
+
+ If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+ same way as described in the CHARACTER PROPERTIES section.
+
+ The CCLNodes[] array consists of groups of three unsigned longs. The first
+ and second are the beginning and ending of a range and the third is the
+ combining class of that range.
+
+ If a character is not found in this table, then the combining class is
+ assumed to be 0.
+
+ It is important to note that only 65536 distinct ranges plus combining class
+ can be specified because the NumCCLNodes is usually a 16-bit number.
+
+NUMBER TABLE
+============
+
+The final data file is called "num.dat" and contains the characters that have
+a numeric value associated with them.
+
+The format for the binary form of the table is:
+
+ unsigned short ByteOrderMark
+ unsigned short NumNumberNodes
+ unsigned long Bytes
+ unsigned long NumberNodes[NumNumberNodes]
+ unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
+ / sizeof(short)]
+
+ If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+ same way as described in the CHARACTER PROPERTIES section.
+
+ The NumberNodes array contains pairs of values, the first of which is the
+ character code and the second an index into the ValueNodes array. The
+ ValueNodes array contains pairs of integers which represent the numerator
+ and denominator of the numeric value of the character. If the character
+ happens to map to an integer, both the values in ValueNodes will be the
+ same.