summaryrefslogtreecommitdiffstats
path: root/src/chrtrans/README.format
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 20:21:21 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 20:21:21 +0000
commit510ed32cfbffa6148018869f5ade416505a450b3 (patch)
tree0aafabcf3dfaab7685fa0fcbaa683dafe287807e /src/chrtrans/README.format
parentInitial commit. (diff)
downloadlynx-510ed32cfbffa6148018869f5ade416505a450b3.tar.xz
lynx-510ed32cfbffa6148018869f5ade416505a450b3.zip
Adding upstream version 2.9.0rel.0.upstream/2.9.0rel.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/chrtrans/README.format')
-rw-r--r--src/chrtrans/README.format138
1 files changed, 138 insertions, 0 deletions
diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format
new file mode 100644
index 0000000..7437b50
--- /dev/null
+++ b/src/chrtrans/README.format
@@ -0,0 +1,138 @@
+Some notes on the format of table files used here.
+(See README.tables for what to do with them.)
+
+The format is derived from stuff in the console driver of the
+Linux kernel (as are the guts of the chartrans machinery).
+THAT DOES NOT MEAN that anything here is Linux specific - it isn't.
+
+[Note that the format may change, this is still somewhat experimental.]
+
+There are four kinds of lines:
+
+Summary example:
+
+ # This line is a comment, the next line is a directive
+ O Brand new Charset!
+ 0x41 U+0041 U+0391
+ U+00cd:I'
+
+Description:
+
+a) comment lines start with a '#' character.
+ (trailing comments are allowed on some of the other lines, if in doubt
+ check the examples..)
+
+b) directives:
+ start with a keyword which may be abbreviated to one letter (first
+ letter must be capitalized), followed by space and a value.
+ Currently recognized:
+
+ OptionName
+ The name under which this should appear on the O)ptions screen
+ in the list for Display Character Set
+ MIMEName
+ The name for this charset in MIME syntax (one word with digits
+ and some other non-letters allowed, should be IANA registered)
+ Default
+ If "Y[es]" or "1", this is the default (fallback) translation table,
+ it will be used for Unicode -> 8bit (or 7bit) translation if no
+ translation is found in the specific table.
+ FallBack
+ Whether to use the default table if no translation is found in
+ this table. Normally fallback is used, "FallBack NO" or "FallBack 0"
+ disables it (actually, other values than "FallBack Y[es]" or
+ "FallBack 1" disable it).
+
+ RawOrEnc
+ a number which flags some special property (encoding) for this
+ charset [see utf8_uni.tbl for example, see UCDefs.h for details].
+
+ Codepage number (IBM specific)
+ used by OS/2 font-switching code.
+
+c) character translation definitions:
+ they look like
+
+ 0x41 U+0041 U+0391 ...
+
+ and are used for "forward" translation (mapping this charset to Unicode)
+ AS WELL AS "back" translation (mapping Unicodes to an 8-bit
+ [incl. 7-bit ASCII] code).
+
+ For the "forward" direction, only the first Unicode is used; for
+ "back" translation, all listed Unicodes are mapped to the byte (i.e.
+ code point) on the left.
+
+ The above example line would tell the chartrans mechanism:
+ "For this charset, code position 65 [hex 0x41] contains Unicode
+ U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to
+ this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL
+ LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)."
+
+ [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations
+ will (probably) not be used by Lynx. It doesn't hurt to list those,
+ too, for completeness.]
+
+ Some other forms are also accepted:
+
+ * Syntax accepted:
+ * <fontpos> <unicode> <unicode> ...
+ * <fontpos> <unicode range> <unicode range> ...
+ * <fontpos> idem
+ * <range> idem
+ * <range> <unicode range>
+ *
+ * where <unicode range> ::= <unicode>-<unicode>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+ *
+ [Note that <fontpos> _without_ targets assumed notdefined,
+ so tables from ftp.unicode.org need no patching.]
+
+
+d) string replacement definitions:
+
+ They look like
+
+ U+00cd:I'
+
+ which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH
+ ACUTE" with the string (consisting of two character) I' (if no other
+ translation is available)." Please note that replacement definitions
+ in certain charset table will override ones from the Default table.
+
+ Note that everything after the ':' is currently taken VERBATIM, so
+ careful with trailing blanks etc. Please use <C replace> syntax below
+ when you need trailing spaces.
+
+ * Syntax accepted:
+ * <unicode> :<replace>
+ * <unicode range> :<replace>
+ * <unicode> "<C replace>"
+ * <unicode range> "<C replace>"
+ *
+ * where <unicode range> ::= <unicode>-<unicode>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+ * and <replace> any string not containing '\n' or '\0', taken verbatim
+ * and <C replace> any string, with backslash having the usual C meaning.
+
+Motivation:
+
+- It is an extension of the format already in use for Linux (kernel,
+ kbd package), those files can be used with some minimal editing.
+
+- It is easy to convert Unicode tables for other charsets, as they
+ are commonly found on ftp sites etc., to this format - the right
+ sed command should do 99% of the work.
+
+- The format is independent of details of other parts of the Lynx code,
+ unlike the "old" LYCharsets.c mechanism. The tables don't have to
+ be changed in synch when e.g., new entities are added to the entities.h.
+
+
+Note: the Default "7bit approximation" table can be used for
+case-insensitive search for non-ascii letters if no upper/lower case
+information provided by other means, e.g., locale. It is assumed that
+upper/lower case letters have their "7bit approximation" images
+in def7_uni.tbl matched case-insensitively.