Unicode Version: 17.0.0
Date: 2024-11-27, 17:44:59 GMT
This page illustrates the application of the Word_Break specification. The material here is informative, not normative.
The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.
Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The × symbol indicates no break, while the ÷ symbol indicates a break. The cells with × are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.
After the heavy blue line in the table are additional rows, either with different sample characters or for sequences, such as “ALetter MidLetter”.
In the row and column headers of the Table, in the Rules, when hovering over characters in the Samples, and in the comments in the associated list of test cases WordBreakTest.txt:
Note that the resulting partition may be finer than needed for the algorithm.
If your browser handles titles (tooltips), then hovering the mouse over the row header will show a sample character of that type. Hovering over a column header will show the sample character, plus its abbreviated general category and script. Hovering over the intersected cells shows the rule number that produces the break-status. For example, hovering over the cell at the intersection of ExtendNumLet and ALetter shows ×, with the rule 13.2. Checking below the table, rule 13.2 is “ExtendNumLet × (AHLetter | Numeric | Katakana)”, which is the one that applies to that case. Note that a rule is invoked only when no lower-numbered rules have applied.
CR | LF | Newline | Extend | Format | Katakana | ALetter_ExtPict | ALettermExtPict | MidLetter | MidNum | MidNumLet | Numeric | ExtendNumLet | RI | Hebrew_Letter | Double_Quote | Single_Quote | ZWJ | ExtPictmALetter | WSegSpace | XXmExtPict | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CR | ÷ | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ |
LF | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ |
Newline | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ |
Extend | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Format | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Katakana | ÷ | ÷ | ÷ | × | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALetter_ExtPict | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
MidLetter | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
MidNum | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
MidNumLet | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Numeric | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ExtendNumLet | ÷ | ÷ | ÷ | × | × | × | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
RI | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Hebrew_Letter | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | × | × | ÷ | ÷ | ÷ |
Double_Quote | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Single_Quote | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ZWJ | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | × | ÷ | ÷ |
ExtPictmALetter | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
WSegSpace | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | × | ÷ |
XXmExtPict | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict Format | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | × | × | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict MidLetter | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict Single_Quote | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict Single_Quote Format | ÷ | ÷ | ÷ | × | × | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | × | ÷ | ÷ | ÷ |
ALettermExtPict MidNum | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Numeric MidLetter | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Numeric Single_Quote | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Numeric MidNum | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
Numeric MidNumLet Format | ÷ | ÷ | ÷ | × | × | ÷ | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ | ÷ | ÷ | × | ÷ | ÷ | ÷ |
This section shows the rules. They are mechanically modified for programmatic generation of the tables and test code, and thus do not match the UAX rules precisely. In particular:
For the original rules and the macro values they use, see UAX #29.
0.2 | sot | ÷ | |
---|---|---|---|
0.3 | ÷ | eot | |
3.0 | CR | × | LF |
3.1 | (Newline | CR | LF) | ÷ | |
3.2 | ÷ | (Newline | CR | LF) | |
3.3 | ZWJ | × | ExtPict |
3.4 | WSegSpace | × | WSegSpace |
4.0 | (?<X>[^CR LF Newline]) (Extend | Format | ZWJ)* | → | {X} |
5.0 | AHLetter | × | AHLetter |
6.0 | AHLetter | × | (MidLetter | MidNumLetQ) AHLetter |
7.0 | AHLetter (MidLetter | MidNumLetQ) | × | AHLetter |
7.1 | Hebrew_Letter | × | Single_Quote |
7.2 | Hebrew_Letter | × | Double_Quote Hebrew_Letter |
7.3 | Hebrew_Letter Double_Quote | × | Hebrew_Letter |
8.0 | Numeric | × | Numeric |
9.0 | AHLetter | × | Numeric |
10.0 | Numeric | × | AHLetter |
11.0 | Numeric (MidNum | MidNumLetQ) | × | Numeric |
12.0 | Numeric | × | (MidNum | MidNumLetQ) Numeric |
13.0 | Katakana | × | Katakana |
13.1 | (AHLetter | Numeric | Katakana | ExtendNumLet) | × | ExtendNumLet |
13.2 | ExtendNumLet | × | (AHLetter | Numeric | Katakana) |
15.0 | ^ (RI RI)* RI | × | RI |
16.0 | [^RI] (RI RI)* RI | × | RI |
999.0 | ÷ | Any |
The following samples illustrate the application of the rules. The blue lines indicate possible break points. If your browser supports titles (tooltips), then positioning the mouse over each character will show its name, while positioning between characters shows the number of the rule responsible for the break-status.
1 | □ □ a □ ◌̈ |
---|---|
2 | a ◌̈ |
3 | □ ن |
4 | ن □ |
5 | ٱ ل ر ◌َ ◌ّ ح ◌ِ ي م ◌ِ □ ١ |
6 | ܡ ܙ ܡ ܘ ܪ ܐ □ ܝ ܗ |
7 | ܬ □ ܫ ܒ ܘ |
8 | A A A |
9 | A : A |
10 | A : : A |
11 | א ' |
12 | א " א |
13 | A 0 0 A |
14 | 0 , 0 |
15 | 0 , , 0 |
16 | 〱 〱 |
17 | A _ 0 _ 〱 _ |
18 | A _ _ A |
19 | 🇦 🇧 🇨 b |
20 | a 🇦 🇧 🇨 b |
21 | a 🇦 🇧 □ 🇨 b |
22 | a 🇦 □ 🇧 🇨 b |
23 | a 🇦 🇧 🇨 🇩 b |
24 | 👶 🏿 👶 |
25 | 🛑 □ 🛑 |
26 | a □ 🛑 |
27 | ✁ □ ✁ |
28 | a □ ✁ |
29 | 👶 🏿 ◌̈ □ 👶 🏿 |
30 | 🛑 🏿 |
31 | □ 🛑 🏿 |
32 | □ 🛑 |
33 | □ 🛑 |
34 | 🛑 🛑 |
35 | a ◌̈ □ ◌̈ b |
36 | a b |