diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-17 12:18:32 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-17 12:18:32 +0000 |
commit | 4547b622d8d29df964fa2914213088b148c498fc (patch) | |
tree | 9fc6b25f3c3add6b745be9a2400a6e96140046e9 /vendor/bstr/scripts/regex/sentence.sh | |
parent | Releasing progress-linux version 1.66.0+dfsg1-1~progress7.99u1. (diff) | |
download | rustc-4547b622d8d29df964fa2914213088b148c498fc.tar.xz rustc-4547b622d8d29df964fa2914213088b148c498fc.zip |
Merging upstream version 1.67.1+dfsg1.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/bstr/scripts/regex/sentence.sh')
-rw-r--r-- | vendor/bstr/scripts/regex/sentence.sh | 176 |
1 files changed, 0 insertions, 176 deletions
diff --git a/vendor/bstr/scripts/regex/sentence.sh b/vendor/bstr/scripts/regex/sentence.sh deleted file mode 100644 index 689d1849f..000000000 --- a/vendor/bstr/scripts/regex/sentence.sh +++ /dev/null @@ -1,176 +0,0 @@ -#!/bin/sh - -# vim: indentexpr= nosmartindent autoindent -# vim: tabstop=2 shiftwidth=2 softtabstop=2 - -# This is a regex that I reverse engineered from the sentence boundary chain -# rules in UAX #29. Unlike the grapheme regex, which is essentially provided -# for us in UAX #29, no such sentence regex exists. -# -# I looked into how ICU achieves this, since UAX #29 hints that producing -# finite state machines for grapheme/sentence/word/line breaking is possible, -# but only easy to do for graphemes. ICU does this by implementing their own -# DSL for describing the break algorithms in terms of the chaining rules -# directly. You can see an example for sentences in -# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state -# machine from those rules in a mostly standard way, but implements the -# "chaining" aspect of the rules by connecting overlapping end and start -# states. For example, given SB7: -# -# (Upper | Lower) ATerm x Upper -# -# Then the naive way to convert this into a regex would be something like -# -# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper} -# -# Unfortunately, this is incorrect. Why? Well, consider an example like so: -# -# U.S.A. -# -# A correct implementation of the sentence breaking algorithm should not insert -# any breaks here, exactly in accordance with repeatedly applying rule SB7 as -# given above. Our regex fails to do this because it will first match `U.S` -# without breaking them---which is correct---but will then start looking for -# its next rule beginning with a full stop (in ATerm) and followed by an -# uppercase letter (A). This will wind up triggering rule SB11 (without -# matching `A`), which inserts a break. -# -# The reason why this happens is because our initial application of rule SB7 -# "consumes" the next uppercase letter (S), which we want to reuse as a prefix -# in the next rule application. A natural way to express this would be with -# look-around, although it's not clear that works in every case since you -# ultimately might want to consume that ending uppercase letter. In any case, -# we can't use look-around in our truly regular regexes, so we must fix this. -# The approach we take is to explicitly repeat rules when a suffix of a rule -# is a prefix of another rule. In the case of SB7, the end of the rule, an -# uppercase letter, also happens to match the beginning of the rule. This can -# in turn be repeated indefinitely. Thus, our actual translation to a regex is: -# -# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}* -# -# It turns out that this is exactly what ICU does, but in their case, they do -# it automatically. In our case, we connect the chaining rules manually. It's -# tedious. With that said, we do no implement Unicode line breaking with this -# approach, which is a far scarier beast. In that case, it would probably be -# worth writing the code to do what ICU does. -# -# In the case of sentence breaks, there aren't *too* many overlaps of this -# nature. We list them out exhaustively to make this clear, because it's -# essentially impossible to easily observe this in the regex. (It took me a -# full day to figure all of this out.) Rules marked with N/A mean that they -# specify a break, and this strategy only really applies to stringing together -# non-breaks. -# -# SB1 - N/A -# SB2 - N/A -# SB3 - None -# SB4 - N/A -# SB5 - None -# SB6 - None -# SB7 - End overlaps with beginning of SB7 -# SB8 - End overlaps with beginning of SB7 -# SB8a - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11 -# SB9 - None -# SB10 - None -# SB11 - None -# SB998 - N/A -# -# SB8a is in particular quite tricky to get right without look-ahead, since it -# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11 -# otherwise indicate that a break has been found. In the regex below, we tackle -# this by only permitting part of SB8a to match inside our core non-breaking -# repetition. In particular, we only allow the parts of SB8a to match that -# permit the non-breaking components to continue. If a part of SB8a matches -# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it -# happen. This still isn't correct because an SContinue might be seen which -# would allow moving back into SB998 and thus the non-breaking repetition, so -# we handle that case as well. -# -# Finally, the last complication here is the sprinkling of $Ex* everywhere. -# This essentially corresponds to the implementation of SB5 by following -# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking -# in the middle of a grapheme cluster. - -CR="\p{sb=CR}" -LF="\p{sb=LF}" -Sep="\p{sb=Sep}" -Close="\p{sb=Close}" -Sp="\p{sb=Sp}" -STerm="\p{sb=STerm}" -ATerm="\p{sb=ATerm}" -SContinue="\p{sb=SContinue}" -Numeric="\p{sb=Numeric}" -Upper="\p{sb=Upper}" -Lower="\p{sb=Lower}" -OLetter="\p{sb=OLetter}" - -Ex="[\p{sb=Extend}\p{sb=Format}]" -ParaSep="[$Sep $CR $LF]" -SATerm="[$STerm $ATerm]" - -LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]" - -echo "(?x) -( - # SB6 - $ATerm $Ex* - $Numeric - | - # SB7 - [$Upper $Lower] $Ex* $ATerm $Ex* - $Upper $Ex* - # overlap with SB7 - ($ATerm $Ex* $Upper $Ex*)* - | - # SB8 - $ATerm $Ex* $Close* $Ex* $Sp* $Ex* - ([^$LetterSepTerm] $Ex*)* $Lower $Ex* - # overlap with SB7 - ($ATerm $Ex* $Upper $Ex*)* - | - # SB8a - $SATerm $Ex* $Close* $Ex* $Sp* $Ex* - ( - $SContinue - | - $ATerm $Ex* - # Permit repetition of SB8a - (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* - # In order to continue non-breaking matching, we now must observe - # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered - # one of SB9-11 and know that a break must follow. - ( - # overlap with SB6 - $Numeric - | - # overlap with SB8 - ($Close $Ex*)* ($Sp $Ex*)* - ([^$LetterSepTerm] $Ex*)* $Lower $Ex* - # overlap with SB7 - ($ATerm $Ex* $Upper $Ex*)* - | - # overlap with SB8a - ($Close $Ex*)* ($Sp $Ex*)* $SContinue - ) - | - $STerm $Ex* - # Permit repetition of SB8a - (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* - # As with ATerm above, in order to continue non-breaking matching, we - # must now observe a match with a rule that keeps us out of SB9-11. - # For STerm, the only such possibility is to see an SContinue. Anything - # else will result in a break. - ($Close $Ex*)* ($Sp $Ex*)* $SContinue - ) - | - # SB998 - # The logic behind this catch-all is that if we get to this point and - # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of - # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since - # SB11 acts as a catch-all to induce a break following a SATerm that isn't - # handled by rules SB6-SB8a. - [^$ParaSep $SATerm] -)* -# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11. -($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)? -" |