Some notes on the format of table files used here. (See README.tables for what to do with them.) The format is derived from stuff in the console driver of the Linux kernel (as are the guts of the chartrans machinery). THAT DOES NOT MEAN that anything here is Linux specific - it isn't. [Note that the format may change, this is still somewhat experimental.] There are four kinds of lines: Summary example: # This line is a comment, the next line is a directive O Brand new Charset! 0x41 U+0041 U+0391 U+00cd:I' Description: a) comment lines start with a '#' character. (trailing comments are allowed on some of the other lines, if in doubt check the examples..) b) directives: start with a keyword which may be abbreviated to one letter (first letter must be capitalized), followed by space and a value. Currently recognized: OptionName The name under which this should appear on the O)ptions screen in the list for Display Character Set MIMEName The name for this charset in MIME syntax (one word with digits and some other non-letters allowed, should be IANA registered) Default If "Y[es]" or "1", this is the default (fallback) translation table, it will be used for Unicode -> 8bit (or 7bit) translation if no translation is found in the specific table. FallBack Whether to use the default table if no translation is found in this table. Normally fallback is used, "FallBack NO" or "FallBack 0" disables it (actually, other values than "FallBack Y[es]" or "FallBack 1" disable it). RawOrEnc a number which flags some special property (encoding) for this charset [see utf8_uni.tbl for example, see UCDefs.h for details]. Codepage number (IBM specific) used by OS/2 font-switching code. c) character translation definitions: they look like 0x41 U+0041 U+0391 ... and are used for "forward" translation (mapping this charset to Unicode) AS WELL AS "back" translation (mapping Unicodes to an 8-bit [incl. 7-bit ASCII] code). For the "forward" direction, only the first Unicode is used; for "back" translation, all listed Unicodes are mapped to the byte (i.e. code point) on the left. The above example line would tell the chartrans mechanism: "For this charset, code position 65 [hex 0x41] contains Unicode U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)." [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations will (probably) not be used by Lynx. It doesn't hurt to list those, too, for completeness.] Some other forms are also accepted: * Syntax accepted: * ... * ... * idem * idem * * * where ::= - * and ::= U+ * and ::= * [Note that _without_ targets assumed notdefined, so tables from ftp.unicode.org need no patching.] d) string replacement definitions: They look like U+00cd:I' which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH ACUTE" with the string (consisting of two character) I' (if no other translation is available)." Please note that replacement definitions in certain charset table will override ones from the Default table. Note that everything after the ':' is currently taken VERBATIM, so careful with trailing blanks etc. Please use syntax below when you need trailing spaces. * Syntax accepted: * : * : * "" * "" * * where ::= - * and ::= U+ * and ::= * and any string not containing '\n' or '\0', taken verbatim * and any string, with backslash having the usual C meaning. Motivation: - It is an extension of the format already in use for Linux (kernel, kbd package), those files can be used with some minimal editing. - It is easy to convert Unicode tables for other charsets, as they are commonly found on ftp sites etc., to this format - the right sed command should do 99% of the work. - The format is independent of details of other parts of the Lynx code, unlike the "old" LYCharsets.c mechanism. The tables don't have to be changed in synch when e.g., new entities are added to the entities.h. Note: the Default "7bit approximation" table can be used for case-insensitive search for non-ascii letters if no upper/lower case information provided by other means, e.g., locale. It is assumed that upper/lower case letters have their "7bit approximation" images in def7_uni.tbl matched case-insensitively.