1 files changed, 339 insertions, 0 deletions
diff --git a/src/grep/TODO b/src/grep/TODO
new file mode 100644
index 0000000..5211ac1
--- /dev/null
+++ b/src/grep/TODO
@@ -0,0 +1,339 @@
+Things to do for GNU grep
+
+  Copyright (C) 1992, 1997-2002, 2004-2021 Free Software Foundation, Inc.
+
+  Copying and distribution of this file, with or without modification,
+  are permitted in any medium without royalty provided the copyright
+  notice and this notice are preserved.
+
+===============
+Short term work
+===============
+
+See where we are with UTF-8 performance.
+
+Merge Debian patches that seem relevant.
+
+Go through patches in Savannah.
+
+Fix --directories=read.
+
+Write better Texinfo documentation for grep.  The manual page would be a
+good place to start, but Info documents are also supposed to contain a
+tutorial and examples.
+
+Some tests in tests/spencer2.tests should have failed!  Need to filter out
+some bugs in dfa.[ch]/regex.[ch].
+
+Multithreading?
+
+GNU grep originally did 32-bit arithmetic.  Although it has moved to
+64-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
+this conversion has not been entirely systematic and should be checked.
+
+Lazy dynamic linking of libpcre.  See Debian’s 03-397262-dlopen-pcre.patch.
+
+Check FreeBSD’s integration of zgrep (-Z) and bzgrep (-J) in one
+binary.  Is there a possibility of doing even better by automatically
+checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
+0x9D for compress, and 0x42 0x5A 0x68 for bzip2)?  Once what to do with
+libpcre is decided, do the same for libz and libbz2.
+
+
+===================
+Matching algorithms
+===================
+
+Take a look at these and consider opportunities for merging or cloning:
+
+   -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
+   -- ja-grep’s mlb2 patch (Japanese grep)
+      <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
+   -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
+      <http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
+   -- cgrep (Context grep) <https://awgn.github.io/cgrep/>
+      seems like nice work;
+   -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
+   -- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
+      from glimpse;
+   -- nr-grep (Nondeterministic reverse grep)
+      <https://www.dcc.uchile.cl/~gnavarro/software/>;
+   -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
+   -- freegrep <https://github.com/howardjp/freegrep>;
+
+Check some new algorithms for matching.  See, for example, Faro &
+Lecroq (cited in kwset.c).
+
+Fix the DFA matcher to never use exponential space.  (Fortunately, these
+cases are rare.)
+
+
+============================
+Standards: POSIX and Unicode
+============================
+
+For POSIX compliance issues, see POSIX 1003.1.
+
+Current support for the POSIX [= =] and [. .] constructs is limited to
+platforms whose regular expression matchers are sufficiently
+compatible with the GNU C library so that the --without-included-regex
+option of ‘configure’ is in effect.  Extend this support to non-glibc
+platforms, where --with-included-regex is in effect, by modifying the
+included version of the regex code to defer to the native version when
+handling [= =] and [. .].
+
+For Unicode, interesting things to check include the Unicode Standard
+<https://www.unicode.org/standard/standard.html> and the Unicode Technical
+Standard #18 (<https://www.unicode.org/reports/tr18/> “Unicode Regular
+Expressions”).  Talk to Bruno Haible who’s maintaining GNU libunistring.
+See also Unicode Standard Annex #15 (<https://www.unicode.org/reports/tr15/>
+“Unicode Normalization Forms”), already implemented by GNU libunistring.
+
+In particular, --ignore-case needs to be evaluated against the standards.
+We may want to deviate from POSIX if Unicode provides better or clearer
+semantics.
+
+POSIX and --ignore-case
+-----------------------
+
+For this issue, interesting things to check in POSIX include the
+Open Group Base Specifications, Chapter “Regular Expressions”, in
+particular Section “Regular Expression General Requirements” and its
+paragraph about caseless matching (this may not have been fully
+thought through and that this text may be self-contradicting
+[specifically: “of either data or patterns” versus all the rest]).
+See:
+
+http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
+
+In particular, consider the following with POSIX’s approach to case
+folding in mind.  Assume a non-Turkic locale with a character
+repertoire reduced to the following various forms of “LATIN LETTER I”:
+
+  0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
+  0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
+  0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
+    LATIN CAPITAL LETTER I DOT;;;0069;
+  0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
+
+UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
+versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1).  This implies that
+whole UTF-8 strings cannot be case-converted in place, using the same
+memory buffer, and that the needed octet-size of the new buffer cannot
+merely be guessed (although there’s a simple upper bound of five times
+the size of the input, as the longest UTF-8 encoding of any character
+is five bytes).
+
+We have
+
+  lc(I) = i, uc(I) = I
+  lc(i) = i, uc(i) = I
+  lc(İ) = i, uc(İ) = İ
+  lc(ı) = ı, uc(ı) = I
+
+where lc() and uc() denote lower-case and upper-case conversions.
+
+There are several candidate --ignore-case logics.  Using the
+
+  if (lc(input_wchar) == lc(pattern_wchar))
+
+logic leads to the following matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  Y  n
+   i  |  Y  Y  Y  n
+   İ  |  Y  Y  Y  n
+   ı  |  n  n  n  Y
+
+There is a lack of symmetry between CAPITAL and SMALL LETTERs with
+this.  Using the
+
+  if (uc(input_wchar) == uc(pattern_wchar))
+
+logic (which is what GNU grep currently does although this is not
+documented or guaranteed in the future), leads to the following
+matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  n  Y
+   i  |  Y  Y  n  Y
+   İ  |  n  n  Y  n
+   ı  |  Y  Y  n  Y
+
+There is a lack of symmetry between CAPITAL and SMALL LETTERs with
+this.
+
+Using the
+
+  if (lc(input_wchar) == lc(pattern_wchar)
+      || uc(input_wchar) == uc(pattern_wchar))
+
+logic leads to the following matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  Y  Y
+   i  |  Y  Y  Y  Y
+   İ  |  Y  Y  Y  n
+   ı  |  Y  Y  n  Y
+
+There is some elegance and symmetry with this.  But there are
+potentially two conversions to be made per input character.  If the
+pattern is pre-converted, two copies of it need to be kept and used in
+a mutually coherent fashion.
+
+Using the
+
+  if (input_wchar  == pattern_wchar
+      || lc(input_wchar) == pattern_wchar
+      || uc(input_wchar) == pattern_wchar)
+
+logic (a plausible interpretation of POSIX) leads to the following
+matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  n  Y
+   i  |  Y  Y  Y  n
+   İ  |  n  n  Y  n
+   ı  |  n  n  n  Y
+
+There is a different CAPITAL/SMALL symmetry with this.  But there’s
+also a loss of pattern/input symmetry that’s unique to it.  Also there
+are potentially two conversions to be made per input character.
+
+Using the
+
+  if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
+
+logic leads to the following matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  Y  Y
+   i  |  Y  Y  Y  Y
+   İ  |  Y  Y  Y  Y
+   ı  |  Y  Y  Y  Y
+
+This shows total symmetry and transitivity (at least in this example
+analysis).  There are two conversions to be made per input character,
+but support could be added for having a single straight mapping
+performing a composition of the two conversions.
+
+Any optimization in the implementation of each logic must not change
+its basic semantic.
+
+
+Unicode and --ignore-case
+-------------------------
+
+For this issue, interesting things to check in Unicode include:
+
+  - The Unicode Standard, Chapter 3
+    (<https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
+    “Conformance”), Section 3.13 (“Default Case Algorithms”) and the
+    toCasefold() case conversion operation.
+
+  - The Unicode Standard, Chapter 4
+    (<https://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
+    “Character Properties”), Section 4.2 (“Case”) and
+    the <https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
+    SpecialCasing.txt and
+    <https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
+    CaseFolding.txt files.
+
+  - The Unicode Standard, Chapter 5
+    (<https://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
+    “Implementation Guidelines”), Section 5.18 (“Case Mappings”),
+    Subsection “Caseless Matching”.
+
+  - The Unicode case charts <https://www.unicode.org/charts/case/>.
+
+Unicode uses the
+
+  if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
+
+logic for caseless matching.  Consider the “LATIN LETTER I” example
+mentioned above.  In a non-Turkic locale, simple case folding yields
+
+  toCasefold_simple(U+0049) = U+0069
+  toCasefold_simple(U+0069) = U+0069
+  toCasefold_simple(U+0130) = U+0130
+  toCasefold_simple(U+0131) = U+0131
+
+which leads to the following matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  n  n
+   i  |  Y  Y  n  n
+   İ  |  n  n  Y  n
+   ı  |  n  n  n  Y
+
+This is different from anything so far!
+
+In a non-Turkic locale, full case folding yields
+
+  toCasefold_full(U+0049) = U+0069
+  toCasefold_full(U+0069) = U+0069
+  toCasefold_full(U+0130) = <U+0069, U+0307>
+  toCasefold_full(U+0131) = U+0131
+
+with
+
+  0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
+
+which leads to the following matches:
+
+    \in  I  i  İ  ı
+  pat\   ----------
+   I  |  Y  Y  *  n
+   i  |  Y  Y  *  n
+   İ  |  n  n  Y  n
+   ı  |  n  n  n  Y
+
+This is just sad!
+
+Having toCasefold(U+0131), simple or full, map to itself instead of
+U+0069 is in contradiction with the rules of Section 5.18 of the
+Unicode Standard since toUpperCase(U+0131) is U+0049.  Same thing for
+toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069.  The
+justification for the weird toCasefold_full(U+0130) mapping is
+unknown; it doesn’t even make sense to add a dot (U+0307) to a letter
+that already has one (U+0069).  It would have been so simple to put
+them all in the same equivalence class!
+
+Otherwise, also consider the following problem with Unicode’s approach
+on case folding in mind.  Assume that we want to perform
+
+  echo 'AßBC' | grep -i 'Sb'
+
+which corresponds to
+
+  input:    U+0041 U+00DF U+0042 U+0043 U+000A
+  pattern:  U+0053 U+0062
+
+Following CaseFolding.txt, applying the toCasefold() transformation to
+these yields
+
+  input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
+  pattern:                U+0073 U+0062
+
+so, according to this approach, the input should match the pattern.
+As long as the original input line is to be reported to the user as a
+whole, there is no problem (from the user’s point-of-view;
+implementation is complicated by this).
+
+However, consider both these GNU extensions:
+
+  echo 'AßBC' | grep -i --only-matching 'Sb'
+  echo 'AßBC' | grep -i --color=always  'Sb'
+
+What is to be reported in these cases, since the match begins in the
+*middle* of the original input character ‘ß’?
+
+Unicode’s toCasefold() cannot be implemented in terms of POSIX’s
+towctrans() since that can only return a single wint_t value per input
+wint_t value.