summaryrefslogtreecommitdiffstats
path: root/vendor/bstr/scripts/regex/sentence.sh
blob: 689d1849fd9c130336584ca7b01096aa8cb99aca (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
#!/bin/sh

# vim: indentexpr= nosmartindent autoindent
# vim: tabstop=2 shiftwidth=2 softtabstop=2

# This is a regex that I reverse engineered from the sentence boundary chain
# rules in UAX #29. Unlike the grapheme regex, which is essentially provided
# for us in UAX #29, no such sentence regex exists.
#
# I looked into how ICU achieves this, since UAX #29 hints that producing
# finite state machines for grapheme/sentence/word/line breaking is possible,
# but only easy to do for graphemes. ICU does this by implementing their own
# DSL for describing the break algorithms in terms of the chaining rules
# directly. You can see an example for sentences in
# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state
# machine from those rules in a mostly standard way, but implements the
# "chaining" aspect of the rules by connecting overlapping end and start
# states. For example, given SB7:
#
#     (Upper | Lower) ATerm x Upper
#
# Then the naive way to convert this into a regex would be something like
#
#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}
#
# Unfortunately, this is incorrect. Why? Well, consider an example like so:
#
#     U.S.A.
#
# A correct implementation of the sentence breaking algorithm should not insert
# any breaks here, exactly in accordance with repeatedly applying rule SB7 as
# given above. Our regex fails to do this because it will first match `U.S`
# without breaking them---which is correct---but will then start looking for
# its next rule beginning with a full stop (in ATerm) and followed by an
# uppercase letter (A). This will wind up triggering rule SB11 (without
# matching `A`), which inserts a break.
#
# The reason why this happens is because our initial application of rule SB7
# "consumes" the next uppercase letter (S), which we want to reuse as a prefix
# in the next rule application. A natural way to express this would be with
# look-around, although it's not clear that works in every case since you
# ultimately might want to consume that ending uppercase letter. In any case,
# we can't use look-around in our truly regular regexes, so we must fix this.
# The approach we take is to explicitly repeat rules when a suffix of a rule
# is a prefix of another rule. In the case of SB7, the end of the rule, an
# uppercase letter, also happens to match the beginning of the rule. This can
# in turn be repeated indefinitely. Thus, our actual translation to a regex is:
#
#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}*
#
# It turns out that this is exactly what ICU does, but in their case, they do
# it automatically. In our case, we connect the chaining rules manually. It's
# tedious. With that said, we do no implement Unicode line breaking with this
# approach, which is a far scarier beast. In that case, it would probably be
# worth writing the code to do what ICU does.
#
# In the case of sentence breaks, there aren't *too* many overlaps of this
# nature. We list them out exhaustively to make this clear, because it's
# essentially impossible to easily observe this in the regex. (It took me a
# full day to figure all of this out.) Rules marked with N/A mean that they
# specify a break, and this strategy only really applies to stringing together
# non-breaks.
#
#     SB1   - N/A
#     SB2   - N/A
#     SB3   - None
#     SB4   - N/A
#     SB5   - None
#     SB6   - None
#     SB7   - End overlaps with beginning of SB7
#     SB8   - End overlaps with beginning of SB7
#     SB8a  - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11
#     SB9   - None
#     SB10  - None
#     SB11  - None
#     SB998 - N/A
#
# SB8a is in particular quite tricky to get right without look-ahead, since it
# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11
# otherwise indicate that a break has been found. In the regex below, we tackle
# this by only permitting part of SB8a to match inside our core non-breaking
# repetition. In particular, we only allow the parts of SB8a to match that
# permit the non-breaking components to continue. If a part of SB8a matches
# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it
# happen. This still isn't correct because an SContinue might be seen which
# would allow moving back into SB998 and thus the non-breaking repetition, so
# we handle that case as well.
#
# Finally, the last complication here is the sprinkling of $Ex* everywhere.
# This essentially corresponds to the implementation of SB5 by following
# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking
# in the middle of a grapheme cluster.

CR="\p{sb=CR}"
LF="\p{sb=LF}"
Sep="\p{sb=Sep}"
Close="\p{sb=Close}"
Sp="\p{sb=Sp}"
STerm="\p{sb=STerm}"
ATerm="\p{sb=ATerm}"
SContinue="\p{sb=SContinue}"
Numeric="\p{sb=Numeric}"
Upper="\p{sb=Upper}"
Lower="\p{sb=Lower}"
OLetter="\p{sb=OLetter}"

Ex="[\p{sb=Extend}\p{sb=Format}]"
ParaSep="[$Sep $CR $LF]"
SATerm="[$STerm $ATerm]"

LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]"

echo "(?x)
(
  # SB6
  $ATerm $Ex*
    $Numeric
  |
  # SB7
  [$Upper $Lower] $Ex* $ATerm $Ex*
    $Upper $Ex*
    # overlap with SB7
    ($ATerm $Ex* $Upper $Ex*)*
  |
  # SB8
  $ATerm $Ex* $Close* $Ex* $Sp* $Ex*
    ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
    # overlap with SB7
    ($ATerm $Ex* $Upper $Ex*)*
  |
  # SB8a
  $SATerm $Ex* $Close* $Ex* $Sp* $Ex*
  (
    $SContinue
    |
    $ATerm $Ex*
      # Permit repetition of SB8a
      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
      # In order to continue non-breaking matching, we now must observe
      # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered
      # one of SB9-11 and know that a break must follow.
      (
        # overlap with SB6
        $Numeric
        |
        # overlap with SB8
        ($Close $Ex*)* ($Sp $Ex*)*
          ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
          # overlap with SB7
          ($ATerm $Ex* $Upper $Ex*)*
        |
        # overlap with SB8a
        ($Close $Ex*)* ($Sp $Ex*)* $SContinue
      )
    |
    $STerm $Ex*
      # Permit repetition of SB8a
      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
      # As with ATerm above, in order to continue non-breaking matching, we
      # must now observe a match with a rule that keeps us out of SB9-11.
      # For STerm, the only such possibility is to see an SContinue. Anything
      # else will result in a break.
      ($Close $Ex*)* ($Sp $Ex*)* $SContinue
  )
  |
  # SB998
  # The logic behind this catch-all is that if we get to this point and
  # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of
  # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since
  # SB11 acts as a catch-all to induce a break following a SATerm that isn't
  # handled by rules SB6-SB8a.
  [^$ParaSep $SATerm]
)*
# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11.
($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)?
"