Merging upstream version 1.67.1+dfsg1.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-17 12:18:32 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-17 12:18:32 +0000
commit: 4547b622d8d29df964fa2914213088b148c498fc (patch)
tree: 9fc6b25f3c3add6b745be9a2400a6e96140046e9 /vendor/bstr/scripts/regex/sentence.sh
parent: Releasing progress-linux version 1.66.0+dfsg1-1~progress7.99u1. (diff)
download: rustc-4547b622d8d29df964fa2914213088b148c498fc.tar.xz
rustc-4547b622d8d29df964fa2914213088b148c498fc.zip
1 files changed, 0 insertions, 176 deletions
diff --git a/vendor/bstr/scripts/regex/sentence.sh b/vendor/bstr/scripts/regex/sentence.sh
deleted file mode 100644
index 689d1849f..000000000
--- a/vendor/bstr/scripts/regex/sentence.sh
+++ /dev/null
@@ -1,176 +0,0 @@
-#!/bin/sh
-
-# vim: indentexpr= nosmartindent autoindent
-# vim: tabstop=2 shiftwidth=2 softtabstop=2
-
-# This is a regex that I reverse engineered from the sentence boundary chain
-# rules in UAX #29. Unlike the grapheme regex, which is essentially provided
-# for us in UAX #29, no such sentence regex exists.
-#
-# I looked into how ICU achieves this, since UAX #29 hints that producing
-# finite state machines for grapheme/sentence/word/line breaking is possible,
-# but only easy to do for graphemes. ICU does this by implementing their own
-# DSL for describing the break algorithms in terms of the chaining rules
-# directly. You can see an example for sentences in
-# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state
-# machine from those rules in a mostly standard way, but implements the
-# "chaining" aspect of the rules by connecting overlapping end and start
-# states. For example, given SB7:
-#
-#     (Upper | Lower) ATerm x Upper
-#
-# Then the naive way to convert this into a regex would be something like
-#
-#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}
-#
-# Unfortunately, this is incorrect. Why? Well, consider an example like so:
-#
-#     U.S.A.
-#
-# A correct implementation of the sentence breaking algorithm should not insert
-# any breaks here, exactly in accordance with repeatedly applying rule SB7 as
-# given above. Our regex fails to do this because it will first match `U.S`
-# without breaking them---which is correct---but will then start looking for
-# its next rule beginning with a full stop (in ATerm) and followed by an
-# uppercase letter (A). This will wind up triggering rule SB11 (without
-# matching `A`), which inserts a break.
-#
-# The reason why this happens is because our initial application of rule SB7
-# "consumes" the next uppercase letter (S), which we want to reuse as a prefix
-# in the next rule application. A natural way to express this would be with
-# look-around, although it's not clear that works in every case since you
-# ultimately might want to consume that ending uppercase letter. In any case,
-# we can't use look-around in our truly regular regexes, so we must fix this.
-# The approach we take is to explicitly repeat rules when a suffix of a rule
-# is a prefix of another rule. In the case of SB7, the end of the rule, an
-# uppercase letter, also happens to match the beginning of the rule. This can
-# in turn be repeated indefinitely. Thus, our actual translation to a regex is:
-#
-#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}*
-#
-# It turns out that this is exactly what ICU does, but in their case, they do
-# it automatically. In our case, we connect the chaining rules manually. It's
-# tedious. With that said, we do no implement Unicode line breaking with this
-# approach, which is a far scarier beast. In that case, it would probably be
-# worth writing the code to do what ICU does.
-#
-# In the case of sentence breaks, there aren't *too* many overlaps of this
-# nature. We list them out exhaustively to make this clear, because it's
-# essentially impossible to easily observe this in the regex. (It took me a
-# full day to figure all of this out.) Rules marked with N/A mean that they
-# specify a break, and this strategy only really applies to stringing together
-# non-breaks.
-#
-#     SB1   - N/A
-#     SB2   - N/A
-#     SB3   - None
-#     SB4   - N/A
-#     SB5   - None
-#     SB6   - None
-#     SB7   - End overlaps with beginning of SB7
-#     SB8   - End overlaps with beginning of SB7
-#     SB8a  - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11
-#     SB9   - None
-#     SB10  - None
-#     SB11  - None
-#     SB998 - N/A
-#
-# SB8a is in particular quite tricky to get right without look-ahead, since it
-# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11
-# otherwise indicate that a break has been found. In the regex below, we tackle
-# this by only permitting part of SB8a to match inside our core non-breaking
-# repetition. In particular, we only allow the parts of SB8a to match that
-# permit the non-breaking components to continue. If a part of SB8a matches
-# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it
-# happen. This still isn't correct because an SContinue might be seen which
-# would allow moving back into SB998 and thus the non-breaking repetition, so
-# we handle that case as well.
-#
-# Finally, the last complication here is the sprinkling of $Ex* everywhere.
-# This essentially corresponds to the implementation of SB5 by following
-# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking
-# in the middle of a grapheme cluster.
-
-CR="\p{sb=CR}"
-LF="\p{sb=LF}"
-Sep="\p{sb=Sep}"
-Close="\p{sb=Close}"
-Sp="\p{sb=Sp}"
-STerm="\p{sb=STerm}"
-ATerm="\p{sb=ATerm}"
-SContinue="\p{sb=SContinue}"
-Numeric="\p{sb=Numeric}"
-Upper="\p{sb=Upper}"
-Lower="\p{sb=Lower}"
-OLetter="\p{sb=OLetter}"
-
-Ex="[\p{sb=Extend}\p{sb=Format}]"
-ParaSep="[$Sep $CR $LF]"
-SATerm="[$STerm $ATerm]"
-
-LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]"
-
-echo "(?x)
-(
-  # SB6
-  $ATerm $Ex*
-    $Numeric
-  |
-  # SB7
-  [$Upper $Lower] $Ex* $ATerm $Ex*
-    $Upper $Ex*
-    # overlap with SB7
-    ($ATerm $Ex* $Upper $Ex*)*
-  |
-  # SB8
-  $ATerm $Ex* $Close* $Ex* $Sp* $Ex*
-    ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
-    # overlap with SB7
-    ($ATerm $Ex* $Upper $Ex*)*
-  |
-  # SB8a
-  $SATerm $Ex* $Close* $Ex* $Sp* $Ex*
-  (
-    $SContinue
-    |
-    $ATerm $Ex*
-      # Permit repetition of SB8a
-      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
-      # In order to continue non-breaking matching, we now must observe
-      # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered
-      # one of SB9-11 and know that a break must follow.
-      (
-        # overlap with SB6
-        $Numeric
-        |
-        # overlap with SB8
-        ($Close $Ex*)* ($Sp $Ex*)*
-          ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
-          # overlap with SB7
-          ($ATerm $Ex* $Upper $Ex*)*
-        |
-        # overlap with SB8a
-        ($Close $Ex*)* ($Sp $Ex*)* $SContinue
-      )
-    |
-    $STerm $Ex*
-      # Permit repetition of SB8a
-      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
-      # As with ATerm above, in order to continue non-breaking matching, we
-      # must now observe a match with a rule that keeps us out of SB9-11.
-      # For STerm, the only such possibility is to see an SContinue. Anything
-      # else will result in a break.
-      ($Close $Ex*)* ($Sp $Ex*)* $SContinue
-  )
-  |
-  # SB998
-  # The logic behind this catch-all is that if we get to this point and
-  # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of
-  # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since
-  # SB11 acts as a catch-all to induce a break following a SATerm that isn't
-  # handled by rules SB6-SB8a.
-  [^$ParaSep $SATerm]
-)*
-# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11.
-($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)?
-"
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-17 12:18:32 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-17 12:18:32 +0000
commit	4547b622d8d29df964fa2914213088b148c498fc (patch)
tree	9fc6b25f3c3add6b745be9a2400a6e96140046e9 /vendor/bstr/scripts/regex/sentence.sh
parent	Releasing progress-linux version 1.66.0+dfsg1-1~progress7.99u1. (diff)
download	rustc-4547b622d8d29df964fa2914213088b148c498fc.tar.xz rustc-4547b622d8d29df964fa2914213088b148c498fc.zip