From ae5d181b854d3ccb373b6bc01b4869e44ff4d87a Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 18:37:15 +0200 Subject: Adding upstream version 2.9.0dev.12. Signed-off-by: Daniel Baumann --- src/chrtrans/README.format | 138 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 src/chrtrans/README.format (limited to 'src/chrtrans/README.format') diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format new file mode 100644 index 0000000..7437b50 --- /dev/null +++ b/src/chrtrans/README.format @@ -0,0 +1,138 @@ +Some notes on the format of table files used here. +(See README.tables for what to do with them.) + +The format is derived from stuff in the console driver of the +Linux kernel (as are the guts of the chartrans machinery). +THAT DOES NOT MEAN that anything here is Linux specific - it isn't. + +[Note that the format may change, this is still somewhat experimental.] + +There are four kinds of lines: + +Summary example: + + # This line is a comment, the next line is a directive + O Brand new Charset! + 0x41 U+0041 U+0391 + U+00cd:I' + +Description: + +a) comment lines start with a '#' character. + (trailing comments are allowed on some of the other lines, if in doubt + check the examples..) + +b) directives: + start with a keyword which may be abbreviated to one letter (first + letter must be capitalized), followed by space and a value. + Currently recognized: + + OptionName + The name under which this should appear on the O)ptions screen + in the list for Display Character Set + MIMEName + The name for this charset in MIME syntax (one word with digits + and some other non-letters allowed, should be IANA registered) + Default + If "Y[es]" or "1", this is the default (fallback) translation table, + it will be used for Unicode -> 8bit (or 7bit) translation if no + translation is found in the specific table. + FallBack + Whether to use the default table if no translation is found in + this table. Normally fallback is used, "FallBack NO" or "FallBack 0" + disables it (actually, other values than "FallBack Y[es]" or + "FallBack 1" disable it). + + RawOrEnc + a number which flags some special property (encoding) for this + charset [see utf8_uni.tbl for example, see UCDefs.h for details]. + + Codepage number (IBM specific) + used by OS/2 font-switching code. + +c) character translation definitions: + they look like + + 0x41 U+0041 U+0391 ... + + and are used for "forward" translation (mapping this charset to Unicode) + AS WELL AS "back" translation (mapping Unicodes to an 8-bit + [incl. 7-bit ASCII] code). + + For the "forward" direction, only the first Unicode is used; for + "back" translation, all listed Unicodes are mapped to the byte (i.e. + code point) on the left. + + The above example line would tell the chartrans mechanism: + "For this charset, code position 65 [hex 0x41] contains Unicode + U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to + this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL + LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)." + + [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations + will (probably) not be used by Lynx. It doesn't hurt to list those, + too, for completeness.] + + Some other forms are also accepted: + + * Syntax accepted: + * ... + * ... + * idem + * idem + * + * + * where ::= - + * and ::= U+ + * and ::= + * + [Note that _without_ targets assumed notdefined, + so tables from ftp.unicode.org need no patching.] + + +d) string replacement definitions: + + They look like + + U+00cd:I' + + which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH + ACUTE" with the string (consisting of two character) I' (if no other + translation is available)." Please note that replacement definitions + in certain charset table will override ones from the Default table. + + Note that everything after the ':' is currently taken VERBATIM, so + careful with trailing blanks etc. Please use syntax below + when you need trailing spaces. + + * Syntax accepted: + * : + * : + * "" + * "" + * + * where ::= - + * and ::= U+ + * and ::= + * and any string not containing '\n' or '\0', taken verbatim + * and any string, with backslash having the usual C meaning. + +Motivation: + +- It is an extension of the format already in use for Linux (kernel, + kbd package), those files can be used with some minimal editing. + +- It is easy to convert Unicode tables for other charsets, as they + are commonly found on ftp sites etc., to this format - the right + sed command should do 99% of the work. + +- The format is independent of details of other parts of the Lynx code, + unlike the "old" LYCharsets.c mechanism. The tables don't have to + be changed in synch when e.g., new entities are added to the entities.h. + + +Note: the Default "7bit approximation" table can be used for +case-insensitive search for non-ascii letters if no upper/lower case +information provided by other means, e.g., locale. It is assumed that +upper/lower case letters have their "7bit approximation" images +in def7_uni.tbl matched case-insensitively. -- cgit v1.2.3