Introduction
Basic Text is a subset of Unicode intending to simplify plain text use cases, so that consumers don't need to worry about control codes, deprecated scalar values, newline conventions, or other complexities.
This book contains the current specification draft and supporting documentation. There is also a Github repo containing a prototype implementation.
Background
Plain text is an intuitive concept that plays an important role in computing.
Intuitively, plain text content shouldn't have side effects. It's just text. However, Unicode contains multiple sets of control codes which effectively form a bytecode language, with a variety of loosely-defined and often non-standardized side effects. Historically, text content and in-band control codes were often mixed together in the same encoding standards, and now, Unicode itself must maintain compatibility with those standards. And, many of them continue to be recognized, particularly in virtual terminals, such that one must be careful about even displaying text from untrusted sources, for example in CVE-2017-10906, CVE-2019-8325, and others.
Other than a few control codes for line endings and a few other things, most of these control codes are obsolete and almost never used for plain text use cases. For cases where we want to work with plain text, and be sure it matches our intuitive sense of what text is, it would be useful to have a checkable subset of Unicode, which excludes problematic control characters, and other things that are not necessary for modern practical plain text use cases.
And if we're defining a Unicode subset for plain text, we also have an opportunity to rationalize line endings so that users don't have to think about the old CRLF vs LF problem. And we can restrict scalar values that Unicode itself deprecates or discourages, but which Unicode itself can't drop because of its need for round-trip compatibility with other character sets, so that consumers that only need to work with Unicode have fewer things to think about.
There are existing standard subsets of Unicode doing similar things, such as PRECIS FreeformClass, printable files in POSIX, and others, however they either restrict scalar values that are frequently used in "plain text" content, or don't restrict deprecated scalar values.
Basic Text is a subset of Unicode aiming to make it as simple as possible (but no simpler) to work with plain text. It is not yet standardized anywhere, and may evolve, but it is usable for many purposes.
Development
Development of these pages, as well as a prototype implementation, is hosted in the Github repo.
Restricted Text
Basic Text can still be visually ambiguous. There are numerous ways that two different scalar value sequences can have identical or similar appearances, such as scalar values representing Cyrillic and Latin letters.
There are several techniques for mitigating visual ambiguities, but some of them are too restrictive for general-purpose plain text, and thus too restrictive for Basic Text.
So in addition to Basic Text, this book also describes a hypothetical format called Restricted Text, which collects such mitigation techniques, and could in theory be developed into an actual format in the future.
Formats
This site discusses three formats.
-
- Any sequence of Unicode Scalar Values, encoded with a standard Unicode encoding form.
-
- This is intended to realize the intuitive phrases "text" or "plain text" which are used in many contexts. It excludes control characters and other content unnecessary or impractical for text.
-
- This describes a language supporting ANSI-style terminal codes that can be layered on top of Basic Text.
-
- Restricted Text is a hypothetical format at this time.
- Like Basic Text, but aims to reduce visual ambiguity by trading off some support for historical scripts, multiple-script text, formatting, and symbols.
The background information contains rationale and source information.
Unicode
Here, the Unicode format is just a sequence of Unicode Scalar Values.
Unicode permits control codes and other non-textual content; see Basic Text for a subset focused on textual content.
Currently Basic Text is based on Unicode 15.0.
Definitions
A string is in Unicode form iff:
- it encodes a sequence of Unicode Scalar Values.
A stream is in Unicode form iff:
- it consists entirely of a string in Unicode form
Conversion
From byte sequence to Unicode string
To convert a byte sequence into a Unicode string in a manner that always succeeds but potentially loses information about invalid encodings:
Basic Text
The Basic Text format is a subset of the Unicode format and meant to fulfill common notions of "plain text". It is not yet standardized anywhere, and may evolve, but it is usable for many purposes.
Basic Text permits homoglyphs and other visual ambiguities; see Restricted Text for an alternative which might provide some mitigations.
For rationale and background information, see Background. For a prototype implementation, see the Github repo.
Definitions
A string is in Basic Text form iff:
- it is a Unicode string in Stream-Safe NFC form, and
- it doesn't start with a Basic Text non-starter, and
- it doesn't end with a Basic Text non-ender, and
- all scalar values with a
General_Category
ofUnassigned
are preceded and followed by U+34F. - it doesn't contain any of the sequences listed in the Sequence Table.
A stream is in Basic Text form iff:
- it consists entirely of a string in Basic Text form, and
- it is empty or ends with U+A.
Supplementary definitions
Basic Text non-starter
A Unicode scalar value is a Basic Text non-starter iff:
- it is a normalization-form non-starter, or
- its
Grapheme_Cluster_Break
isZWJ
,SpacingMark
orExtend
and it isn't U+34F.
Basic Text non-ender
A Unicode scalar value is a Basic Text non-ender iff:
- its
Grapheme_Cluster_Break
isZWJ
orPrepend
.
Sequence Table
Sequence | aka | Replacement | Error |
---|---|---|---|
U+D U+A | CRLF | U+A | "Use U+A to terminate a line" |
U+D | CR | U+A | "Use U+A to terminate a line" |
[U+C]+ U+D U+A | U+A | "Control code not valid in text" | |
[U+C]+ U+A | U+A | "Control code not valid in text" | |
[U+C]+ U+D | U+A | "Control code not valid in text" | |
[U+C]+ | FF | U+20 | "Control code not valid in text" |
U+1B U+5B [U+20–U+3F]* U+6D | SGR | "Color escape sequences are not enabled" | |
[U+1B]+ U+5B U+5B [U+–U+7F]? | "Unrecognized escape sequence" | ||
[U+1B]+ U+5B [U+20–U+3F]* [U+40–U+7E]? | CSI | "Unrecognized escape sequence" | |
[U+1B]+ U+5D [^U+7,U+18,U+1B]* [U+7,U+18]? | OSC | "Unrecognized escape sequence" | |
[U+1B]+ [U+40–U+7E] | ESC | "Unrecognized escape sequence" | |
[U+1B]+ | ESC | "Escape code not valid in text" | |
[U+0–U+8,U+B,U+E–U+1F] | C0 | U+FFFD | "Control code not valid in text" |
U+7F | DEL | U+FFFD | "Control code not valid in text" |
U+85 | NEL | U+20 | "Control code not valid in text" |
[U+80–U+84,U+86–U+9F] | C1 | U+FFFD | "Control code not valid in text" |
U+149 | ʼn | U+2BC U+6E | "Use U+2BC U+6E instead of U+149" |
U+673 | ا ٟ | U+627 U+65F | "Use U+627 U+65F instead of U+673" |
U+9E4 | । | U+FFFD | "Use U+964 instead of U+9E4" |
U+9E5 | ॥ | U+FFFD | "Use U+965 instead of U+9E5" |
U+A64 | । | U+FFFD | "Use U+964 instead of U+A64" |
U+A65 | ॥ | U+FFFD | "Use U+965 instead of U+A65" |
U+AE4 | । | U+FFFD | "Use U+964 instead of U+AE4" |
U+AE5 | ॥ | U+FFFD | "Use U+965 instead of U+AE5" |
U+B64 | । | U+FFFD | "Use U+964 instead of U+B64" |
U+B65 | ॥ | U+FFFD | "Use U+965 instead of U+B65" |
U+BE4 | । | U+FFFD | "Use U+964 instead of U+BE4" |
U+BE5 | ॥ | U+FFFD | "Use U+965 instead of U+BE5" |
U+C64 | । | U+FFFD | "Use U+964 instead of U+C64" |
U+C65 | ॥ | U+FFFD | "Use U+965 instead of U+C65" |
U+CE4 | । | U+FFFD | "Use U+964 instead of U+CE4" |
U+CE5 | ॥ | U+FFFD | "Use U+965 instead of U+CE5" |
U+D64 | । | U+FFFD | "Use U+964 instead of U+D64" |
U+D65 | ॥ | U+FFFD | "Use U+965 instead of U+D65" |
U+F77 | ◌ྲ◌ཱྀ | U+FB2 U+F71 U+F80 | "Use U+FB2 U+F71 U+F80 instead of U+F77" |
U+F79 | ◌ླ◌ཱྀ | U+FB3 U+F71 U+F80 | "Use U+FB3 U+F71 U+F80 instead of U+F79" |
U+17A3 | អ | U+17A2 | "Use U+17A2 instead of U+17A3" |
U+17A4 | អា | U+17A2 U+17B6 | "Use U+17A2 U+17B6 instead of U+17A4" |
U+17B4 | U+FFFD | "Omit U+17B4" | |
U+17B5 | U+FFFD | "Omit U+17B5" | |
U+17D8 | U+FFFD | "Spell beyyal with normal letters" | |
U+2028 | LS | U+20 | "Line separation is a rich-text function" |
U+2029 | PS | U+20 | "Paragraph separation is a rich-text function" |
U+202A | LRE | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+202B | RLE | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+202C | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" | |
U+202D | LRO | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+202E | RLO | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+2066 | LRI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+2067 | RLI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+2068 | FSI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
U+2069 | PDI | U+FFFD | "Explicit Bidirectional Formatting Characters are unsupported" |
[U+206A–U+206F] | U+FFFD | "Deprecated Format Characters are deprecated" | |
U+2072 | ² | U+FFFD | "Use U+B2 instead of U+2072" |
U+2073 | ³ | U+FFFD | "Use U+B3 instead of U+2073" |
U+2126 | Ω | U+3A9 | "Use U+3A9 instead of U+2126" |
U+212A | K | U+4B | "Use U+4B instead of U+212A" |
U+212B | Å | U+C5 | "Use U+C5 instead of U+212B" |
U+2329 | ⟨ | U+FFFD | "Use U+27E8 instead of U+2329" |
U+232A | ⟩ | U+FFFD | "Use U+27E9 instead of U+232A" |
U+2DF5 | ⷭⷮ | U+2DED U+2DEE | "Use U+2DED U+2DEE instead of U+2DF5" |
[U+F900–U+FA0D,U+FA10,U+FA12,U+FA15–U+FA1E,U+FA20,U+FA22,U+FA25–U+FA26,U+FA2A–U+FA6D,U+FA70–U+FAD9] | CJK compatibility ideograph Standardized Variant | "Use Standardized Variants instead of CJK Compatibility Ideographs" | |
U+FB00 | ff | U+66 U+66 | "Use U+66 U+66 instead of U+FB00" |
U+FB01 | fi | U+66 U+69 | "Use U+66 U+69 instead of U+FB01" |
U+FB02 | fl | U+66 U+6C | "Use U+66 U+6C instead of U+FB02" |
U+FB03 | ffi | U+66 U+66 U+66 | "Use U+66 U+66 U+69 instead of U+FB03" |
U+FB04 | ffl | U+66 U+66 U+6C | "Use U+66 U+66 U+6C instead of U+FB04" |
U+FB05 | ſt | U+17F U+74 | "Use U+17F U+74 instead of U+FB05" |
U+FB06 | st | U+73 U+74 | "Use U+73 U+74 instead of U+FB06" |
[U+FDD0–U+FDEF] | U+FFFD | "Noncharacters are intended for internal use only" | |
U+FEFF | BOM | U+2060 | "U+FEFF is not necessary in Basic Text" |
[U+FFF9–U+FFFB] | U+FFFD | "Interlinear Annotations depend on out-of-band information" | |
U+FFFC | ORC | U+FFFD | "U+FFFC depends on out-of-band information" |
[U+FFFE,U+FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
U+1D455 | ℎ | U+FFFD | "Use U+210E instead of U+1D455" |
U+1D49D | ℬ | U+FFFD | "Use U+212C instead of U+1D49D" |
U+1D4A0 | ℰ | U+FFFD | "Use U+2130 instead of U+1D4A0" |
U+1D4A1 | ℱ | U+FFFD | "Use U+2131 instead of U+1D4A1" |
U+1D4A3 | ℋ | U+FFFD | "Use U+210B instead of U+1D4A3" |
U+1D4A4 | ℐ | U+FFFD | "Use U+2110 instead of U+1D4A4" |
U+1D4A7 | ℒ | U+FFFD | "Use U+2112 instead of U+1D4A7" |
U+1D4A8 | ℳ | U+FFFD | "Use U+2133 instead of U+1D4A8" |
U+1D4AD | ℛ | U+FFFD | "Use U+211B instead of U+1D4AD" |
U+1D4BA | ℯ | U+FFFD | "Use U+212F instead of U+1D4BA" |
U+1D4BC | ℊ | U+FFFD | "Use U+210A instead of U+1D4BC" |
U+1D4C4 | ℴ | U+FFFD | "Use U+2134 instead of U+1D4C4" |
U+1D506 | ℭ | U+FFFD | "Use U+212D instead of U+1D506" |
U+1D50B | ℌ | U+FFFD | "Use U+210C instead of U+1D50B" |
U+1D50C | ℑ | U+FFFD | "Use U+2111 instead of U+1D50C" |
U+1D515 | ℜ | U+FFFD | "Use U+211C instead of U+1D515" |
U+1D51D | ℨ | U+FFFD | "Use U+2128 instead of U+1D51D" |
U+1D53A | ℂ | U+FFFD | "Use U+2102 instead of U+1D53A" |
U+1D53F | ℍ | U+FFFD | "Use U+210D instead of U+1D53F" |
U+1D545 | ℕ | U+FFFD | "Use U+2115 instead of U+1D545" |
U+1D547 | ℙ | U+FFFD | "Use U+2119 instead of U+1D547" |
U+1D548 | ℚ | U+FFFD | "Use U+211A instead of U+1D548" |
U+1D549 | ℝ | U+FFFD | "Use U+211D instead of U+1D549" |
U+1D551 | ℤ | U+FFFD | "Use U+2124 instead of U+1D551" |
[U+1FFFE,U+1FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
U+111C4 | 𑆏𑆀 | U+1118F U+11180 | "Use U+1118F U+11180 instead of U+111C4" |
[U+2F800–U+2FA1D] | CJK compatibility ideograph Standardized Variant | "Use Standardized Variants instead of CJK Compatibility Ideographs" | |
[U+2FFFE,U+2FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+3FFFE,U+3FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+4FFFE,U+4FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+5FFFE,U+5FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+6FFFE,U+6FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+7FFFE,U+7FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+8FFFE,U+8FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+9FFFE,U+9FFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+AFFFE,U+AFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+BFFFE,U+BFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+CFFFE,U+CFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+DFFFE,U+DFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
U+E0001 | U+FFFD | "Language tagging is a deprecated mechanism" | |
[U+EFFFE,U+EFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+FFFFE,U+FFFFF] | U+FFFD | "Noncharacters are intended for internal use only" | |
[U+10FFFE,U+10FFFF] | U+FFFD | "Noncharacters are intended for internal use only" |
Conversion
From Unicode string to Basic Text string
To convert a Unicode string into a Basic Text string in a manner that always succeeds, discarding information not usually considered meaningful or valid in plain text:
- If the string starts with a Basic Text non-starter, prepend U+34F.
- If the string ends with a Basic Text non-ender, append U+34F.
- When NEL Compatibility is enabled, replace any U+85 with U+A.
- When LSPS Compatibility is enabled, replace any U+2028 or U+2029 with U+A.
- Perform the Replacement actions from the Sequence Table.
- For any scalar value with
General_Category
ofUnassigned
that isn't already preceded by U+34F, insert U+34F before it. - For any scalar value with
General_Category
ofUnassigned
that isn't already followed by U+34F, insert U+34F after it. - Perform the Stream-Safe Text Process (UAX15-D4).
- Perform
toNFC
with the Normalization Process.
Options
The following options may be enabled:
Name | Type | Default |
---|---|---|
NEL Compatibility | Boolean | false |
LSPS Compatibility | Boolean | false |
From Unicode string to Basic Text string, strict
To convert a Unicode string into a Basic Text string in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text:
- If the string starts with a Basic Text non-starter, error with "Basic Text string must not begin with Basic Text non-starter".
- If the string ends with a Basic Text non-ender, error with "Basic Text string must not end with Basic Text non-ender".
- Perform the Error actions from the Sequence Table.
- When CRLF Compatibility is enabled, replace any U+A with U+D U+A.
- For any scalar value with
General_Category
ofUnassigned
that isn't already preceded by U+34F, insert U+34F before it. - For any scalar value with
General_Category
ofUnassigned
that isn't already followed by U+34F, insert U+34F after it. - Perform the Stream-Safe Text Process (UAX15-D4).
- Perform
toNFC
with the Normalization Process.
Options
The following options may be enabled:
Name | Type | Default |
---|---|---|
CRLF Compatibility | Boolean | false |
From Unicode stream to Basic Text stream
To convert a Unicode stream into a Basic Text stream in a manner than always succeeds, discarding information not usually considered meaningful or valid in plain text:
- If the stream starts with U+FEFF, remove it.
- If the stream is non-empty and doesn't end with U+A or U+D, append a U+A.
- Perform From Unicode string to Basic Text string.
From Unicode stream to Basic Text stream, strict
To convert a Unicode stream into a Basic Text stream in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text:
- When BOM Compatibility is enabled, insert a U+FEFF at the beginning of the stream.
- If the stream is non-empty and doesn't end with U+A or U+D, error with "Basic Text stream must be empty or end with newline".
- Perform From Unicode string to Basic Text string, strict.
Options
The following options may be enabled:
Name | Type | Default |
---|---|---|
BOM Compatibility | Boolean | false |
Terminal Support
This document describes extensions to Basic Text adding ANSI-style terminal features. It is experimental.
Terminal Output
Terminal Output uses Basic Text's "strict" conversions.
Output Feature Sets
Terminal output features are grouped into sets, which can be supported independently or in combination:
- Line-oriented output, eg. for
readline
- Full-screen output, eg. for
vim
- Color, eg. for color
ls
- Custom Title, eg. for shell command prompts
Line-oriented output
This feature adds line-oriented editing features.
The following control codes are recognized:
Code | Meaning |
---|---|
U+7 | Alert |
U+8 | Move cursor back one column |
U+9 | Tab |
U+A | End of line |
U+C | FF Terminal Compatibility |
U+D | Carriage Return |
U+7F | No Effect |
The following escape sequences are recognized:
Sequence | Meaning |
---|---|
␛[K | Clear to end of line |
␛[0K | Clear to end of line |
␛[2K | Clear entire line |
Alert
Produce an acoustic indication or a visual indication if possible, without modifying the state of the terminal.
Move cursor back one column
Move the cursor back one column, but not past the first column.
Tab
Move the cursor back one column, but not past the first column.
End of line
Move the cursor to the beginning of the next line, scrolling the output if needed.
FF Terminal Compatibility
Move the cursor to the next line without changing the column.
Carriage Return
Move the cursor to the first column of the current line.
No Effect
Leave the state of the terminal unmodified.
Clear to end of line
Clear to the end of the line, leaving the cursor where it is.
Clear entire line
Clear the entire line, and move the cursor to the first column.
Full-screen output
This feature set adds a "full screen" mode which may be enabled at runtime, which supports two-dimensional cursor positioning, scrolling, screen clearing, and related features.
In the default mode, the following escape sequences are recognized:
Sequence | Meaning | Notes |
---|---|---|
␛[?1049h | Enter full-screen mode, with a clear screen and default settings |
The following escape sequences are recognized within full-screen mode:
Sequence | Meaning | Notes |
---|---|---|
␛7 | save_cursor | TODO: Do we need this? |
␛8 | restore_cursor | Ditto |
␛H | set_tab | Ditto |
␛M | scroll_reverse | TODO: Is this different on Windows? |
␛[«n»@ | parm_ich(«n») | «n» may be omitted and defaults to 1 |
␛[«n»A | parm_up_cursor(«n») | Ditto |
␛[«n»B | parm_down_cursor(«n») | Ditto |
␛[«n»C | parm_right_cursor(«n») | Ditto |
␛[«n»D | parm_left_cursor(«n») | Ditto |
␛[«n»G | column_address(«n») | Ditto |
␛[«line»;«column»H | cursor_address(«row», «column») | «line»;«column» may be omitted and default to 1;1 |
␛[«n»I | tab(«n») | «n» may be omitted and defaults to 1 |
␛[0J | clr_eos | The 0 is optional |
␛[1J | Clear the screen from the beginning to the current cursor position | |
␛[2J | Clear the screen | Unlike clear_screen , this doesn't change the cursor position |
␛[«n»L | insert_line(«n») | «n» may be omitted and defaults to 1 |
␛[«n»M | parm_delete_line(«n») | Ditto |
␛[«n»P | parm_dch(«n») | Ditto |
␛[«n»S | parm_index(«n») | Ditto |
␛[«n»T | parm_rindex(«n») | Ditto |
␛[«n»X | erase_chars(«n») | Ditto |
␛[«n»Z | cbt(«n») | Ditto |
␛[«n»d | row_address(«n») | Ditto |
␛[«line»;«column»f | Same as the similar sequence ending in H | |
␛[3g | clear_all_tabs | TODO: do we need this? |
␛[?25h | cursor_visible | |
␛[?1049h | Clear the screen and reset full-screen settings to defaults | |
␛[?2004h | Enable bracketed paste mode | |
␛[?25l | cursor_invisible | |
␛[?1049l | Exit full-screen mode and restore the terminal to its prior state | |
␛[?2004l | Disable bracketed paste mode | |
␛[!p | Reset the terminal to default settings, without clearing the screen | |
␛[«top»;«bottom»r | change_scroll_region(«top», «bottom») | «top»;«bottom» may be omitted and default to 1;«viewpoint-height» |
TODO: Describe the behavior on on the rightmost column and bottom-most line, and other traditionally underspecified things.
TODO: Describe parameters in more detail, including the syntax for numeric and string parameters, and min/max valid values for numeric parameters.
Color
This feature set adds color and display attributes such as bold, underline, and italics.
This feature defines the following escape sequences:
Sequence | Meaning | Notes |
---|---|---|
␛[…m | set_attributes(…) | Set text attributes; see below for the meaning of … |
␛[38;2;«r»;«g»;«b»m | Set foreground color to RGB «r» , «g» , «b» | Values are from 0-255 |
␛[48;2;«r»;«g»;«b»m | Set background color to RGB «r» , «g» , «b» | Ditto |
In the …
form above, the …
may be replaced by up to 16 ;
-separated
sequences from the following:
Sequence | Meaning | Notes |
---|---|---|
0 | Normal (default) | |
1 | Bold | |
2 | Faint | Faint may not appear visually distinct on some platforms |
4 | Underlined | May be "simulated with color". Applications may wish to use U+332 instead. |
7 | Inverse | |
22 | Not bold or faint | |
23 | Not italicized | |
24 | Not underlined (any kind) | |
27 | Not inverse | |
29 | Not crossed-out | |
30 | Foreground Black | |
31 | Foreground Red | |
32 | Foreground Green | |
33 | Foreground Yellow | May appear brown on some platforms |
34 | Foreground Blue | |
35 | Foreground Magenta | |
36 | Foreground Cyan | |
37 | Foreground White | |
39 | Foreground default | |
40 | Background Black | |
41 | Background Red | |
42 | Background Green | |
43 | Background Yellow | |
44 | Background Blue | |
45 | Background Magenta | |
46 | Background Cyan | |
47 | Background White | |
49 | Background default | |
90 | Foreground bright Black | Bright colors may not appear visually distinct on some platforms |
91 | Foreground bright Red | |
92 | Foreground bright Green | |
93 | Foreground bright Yellow | |
94 | Foreground bright Blue | |
95 | Foreground bright Magenta | |
96 | Foreground bright Cyan | |
97 | Foreground bright White | |
100 | Background bright Black | |
101 | Background bright Red | |
102 | Background bright Green | |
103 | Background bright Yellow | |
104 | Background bright Blue | |
105 | Background bright Magenta | |
106 | Background bright Cyan | |
107 | Background bright White |
Not all terminal support all colors; when a requested color is unavailable, terminals may substitute the closest available color.
Custom Title
This feature set adds the ability to set a custom window title.
This feature defines the following escape sequences:
Sequence | Meaning | Notes |
---|---|---|
␛]0;«string»␇ | Sets the terminal's title to «string» | Implementations may implicitly add a prefix and/or truncate the string |
␛]2;«string»␇ | Sets the terminal's title to «string» | Ditto |
Terminal input
Terminal Input uses Basic Text's normal (not "strict") conversions.
Most keys have obvious mappings to Unicode scalar value sequences. This section describes mapping for special keys read from a terminal.
Three modifiers are recognized: Ctrl, Alt, and Shift. In environments with Meta keys, Meta is mapped to Alt.
Terminal input control codes
The following control codes are recognized:
Code | Meaning | Notes |
---|---|---|
U+0 | Ctrl-Space | |
U+8 | Ctrl-H | Despite U+8 being historically called "backspace" in ASCII, it isn't the backspace key |
U+9 | Tab | |
U+A | Enter | U+A means "end of line" |
U+C | Ctrl-L | This is only transmitted in immediate mode, and requests applications refresh the screen |
U+11 | Ctrl-Q | When enabled in the terminal input mode |
U+13 | Ctrl-S | When enabled in the terminal input mode |
U+1B | Escape | When read in in immediate input mode |
U+1C | Ctrl-\ | When enabled in the terminal input mode |
U+1D | Ctrl-] | |
U+1E | Ctrl-^ | |
U+1F | Ctrl-_ | |
U+7F | Backspace | This is the backspace key |
The following control codes are interpreted by the implementation and not passed on to applications:
Code | Commonly typed as | Behavior |
---|---|---|
U+3 | Ctrl-C | Terminate the program, when not enabled in the terminal input mode |
U+9 | Tab | No effect when modifiers include Alt |
U+D | Ctrl-M | Send U+A to the program, when read in a single input call in immediate input mode |
U+11 | Ctrl-Q | No effect when not enabled in the terminal input mode |
U+13 | Ctrl-S | No effect when not enabled in the terminal input mode |
U+1A | Ctrl-Z | Suspend the program |
U+1C | Ctrl-\ | Terminate the program, when not enabled in the terminal input mode |
U+60 | ` | No effect when modifiers include Alt |
Except as specified otherwise above, U+1 through U+1A are recognized as
Ctrl-A
through Ctrl-Z
, respectively.
Codes with values U+0 through U+7F, except for U+5B ([
) and
U+5D (]
), may be preceded by U+1B indicating the Alt modifier.
When a program is resumed from being suspended, any streams open in immediate input mode are passed a U+C (Ctrl-L). Applications are encouraged to interpret Ctrl-L as a command to redraw the screen.
Terminal input escape sequences
The following escape sequences are recognized when they are read as a single input call in immediate input mode:
Sequence | Meaning | Notes |
---|---|---|
␛[A | Up | |
␛[B | Down | |
␛[C | Right | |
␛[D | Left | |
␛[F | End | |
␛[H | Home | |
␛[1«m»A | Up | Same as above, but with modifiers |
␛[1«m»B | Down | Ditto |
␛[1«m»C | Right | Ditto |
␛[1«m»D | Left | Ditto |
␛[Z | Shift-Tab | |
␛[1«m»~ | Home | Same as above, but with modifiers |
␛[2«m?»~ | Insert | |
␛[3«m?»~ | Delete | |
␛[4«m»~ | End | Same as above, but with modifiers |
␛[5«m?»~ | Page Up | |
␛[6«m?»~ | Page Down | |
␛[11«m?»~ | F1 | These use the "old xterm"/CSI values, rather than vt102/vt220/SS3/Windows values |
␛[12«m?»~ | F2 | |
␛[13«m?»~ | F3 | |
␛[14«m?»~ | F4 | |
␛[15«m?»~ | F5 | |
␛[17«m?»~ | F6 | (yes, 16 really is skipped) |
␛[18«m?»~ | F7 | |
␛[19«m?»~ | F8 | |
␛[20«m?»~ | F9 | |
␛[21«m?»~ | F10 | |
␛[23«m?»~ | F11 | (yes, 22 really is skipped) |
␛[24«m?»~ | F12 | |
␛[200«m?»~ | Begin Paste | Only emitted when bracketed paste mode is activated |
␛[201«m?»~ | End Paste | Ditto |
«m»
is a modifier sequence:
Sequence | Shift | Alt | Ctrl |
---|---|---|---|
;2 | ✓ | ||
;3 | ✓ | ||
;4 | ✓ | ✓ | |
;5 | ✓ | ||
;6 | ✓ | ✓ | |
;7 | ✓ | ✓ | |
;8 | ✓ | ✓ | ✓ |
and «m?»
is an optional modifier sequence.
In environments with keys F13 through F24, they are mapped to F1 through F12 with the shift modifier.
As special cases, Delete, Insert, Home, End, Page Up and Down, and F1 and F12 with the Ctrl-Alt or Ctrl-Alt-Shift modifiers are reserved and not passed on to the application.
Input Modes
The following options are added to the options for Basic Text:
Name | Type | Applicability |
---|---|---|
Immediate mode | Boolean | Input |
Hidden mode | Boolean | Input |
Immediate mode
In Immediate input mode, each keypress is treated as if it were followed by U+A and immediately sent to the application without the extra U+A. And as a special case, U+C (FF) is not replaced in immediate mode.
Hidden mode
In Hidden input mode, terminal implementations should not echo input characters back to the terminal.
Restricted Text
The Restricted Text format is a subset of the Basic Text format. It incorporates several restrictions which reduce the expressiveness of the format in order to reduce visual ambiguity.
This format is entirely hypothetical at this time. It's formed from a loose collection of ideas from a variety of sources, and is not yet ready for any practical purpose.
This format does not define conversion from Basic Text or other less restrictive formats, as that may cause meaning to be silently lost. Instead, errors should be reported when content not meeting these restrictions is encountered in any context where restricted text is expected. See Basic Text for an unrestricted alternative.
Definitions
A string is in Restricted Text form iff:
- it is in Basic Text form, and
- it is in NFKC form, and
- it is Moderately Restricted text, and
- it does not contain any of the sequences listed in the Sequence Table.
A stream is in Restricted Text form iff:
- it is a stream in Basic Text form, and
- it consists entirely of a string in Restricted Text form.
Note that even though this excludes U+34F (COMBINING GRAPHEME JOINER), the Stream Safe Text Format is still required; content must simply avoid using excessively long sequences of non-starters.
Sequence Table
Sequence | Error |
---|---|
[U+FE00–U+FE0F] | "Variation selectors are not always visually distinct" |
[U+E0100–U+E01EF] | "Variation selectors are not always visually distinct" |
Default Ignorable Code Points | "Default Ignorable Code Points are not visually distinct" |
Old Hangul Jamo | "Conjoining Hangul Jamo are restricted in RFC5892" |
Tag Characters | "Tag Characters are not permitted" |
Private-Use Characters | "Private-use characters depend on private agreements" |
Conversion
From Basic Text string to Restricted Text string
To convert a Basic Text string into a Restricted Text string in a manner that never loses information but may fail:
- If performing
toNFKC
with the Normalization Process for Stabilized Strings would alter the contents, error with "Restricted Text must be in NFKC form". - If Restriction Level Detection classifies the string as less than Moderately Restricted, error with "Restricted Text must be Moderately Restricted".
- Perform the Error actions from the Sequence Table.
From Basic Text stream to Restricted Text stream
To convert a Basic Text stream into a Restricted Text stream in a manner than never loses information but may fail:
TODO
TODO: "Moderately Restricted" isn't stable over time.
TODO: Mixed-Number Detection
TODO: Unicode Security Mechanisms also specifies some Optional Detection rules.
TODO: U+2126 (OHM SIGN) normalizes to U+3A9 (GREEK CAPITAL LETTER OMEGA); does "Moderately Restricted" permit this Greek letter to be mixed with otherwise Latin script?
TODO: Several Braille scalars have visual similarities with other scalars, such as U+2800 and U+20, U+2802 and U+B7, and so on.
TODO: Several scalars such as U+1160, U+2062, U+FFA0, U+115F, U+16FE4, and possibly others, may display as whitespace despite not being categorized as whitespace. Can we constrain them with a mixed-script constraint, or some other mechanism?
TODO: Implicit Directional Marks have no display.
Background
This document explains the decisions behind the Basic Text format and provides links to related standards, documentation, and other resources.
Overall approach to Basic Text
Approach
Basic Text is a new and still evolving format. If the explanations in the rest of this document convey finality in the decisions, it's only because they're drafts of the kinds of things a design document may eventually want to say.
Feedback, corrections, bug reports, or even just example bodies of text that would be interesting to test on are all welcome; please file issues in the issue tracker!
No Stability (Yet)
At this time, there is no stability guarantee, for either forwards or backwards compatibility.
In the future, Basic Text is expected to have a stability policy where strings which are valid in one version remain valid in newer versions, even as Unicode and NFC evolve. Unassigned scalar values are isolated with CGJs to protect them from future normalizations. Older versions should still be able to read newer strings, though they may insert extra CGJs to isolate scalar values they don't recognize.
Rationale
The following sections provide information about the decisions that Basic Text makes, and links to related standards and documentation.
NFC, Normalization
Basic Text normalizes to NFC, using a special algorithm to minimize loss of intent. See this page for motivation and rationale.
Newlines
In Basic Text content, U+A, and nothing else, is a line terminator, sometimes also called a newline.
Why not use the CRLF convention? CRLF is what IETF RFCs use, as well as ASCII itself (after ASCII-1986 / ECMA-6:1985 at least). Basic Text uses U+A because:
- U+A is what IEEE POSIX, ISO C and C++, and many other programs use in program data.
- The newline convention is only one scalar, so it's simpler than CRLF and avoids corner-case concerns of what to do when CR and LF are split apart.
- The newline convention is also only one byte in UTF-8, so it can be recognized without full UTF-8 decoding.
- All practical text editors and viewers today support the U+A newline convention, even Windows Notepad.
Lossy text conversion implicitly translates plain CR and CRLF into newline, which is a common convention. Python calls this behavior Universal Newlines.
By default, lossy text conversion translates NEL, LS, and PS into U+20 which, which for those rare formats which recognize these scalars at all, is compatible with how they're typically treated. As options, lossy text conversion can also translate NEL, or LS and PS, into newlines, for example to support the text conventions used in XML 1.1 and JavaScript source code, respectively.
By default, strict text conversion rejects CRLF and other line terminator sequences other than U+A. As an option, strict text conversion can translate U+A into CRLF, for example to support the text conventions used in IETF RFCs.
Why not follow the Unicode Newline Guidelines' Recommendations?
- We effectively target a virtual platform with U+A as the platform NLF.
- Plain text does not have an inherent concept of paragraphs, so
recommendation R2 isn't meaningful. Paragraphs are only meaningful in
higher-level protocols (for example, see HTML's
<br>
and<p>
). - Recommendation R4's inclusion of FF, LS, and PS seems to be universally ignored in line-reading functions of all mainstream programming languages we've surveyed.
- NEL, LS, and PS are rare in practice, and formats which even recognize them are rare in practice.
- Also, see the section on Form Feed below.
Plain text uses line terminators, rather than line separators. This means that plain text streams end with a line terminator (if they are non-empty). Lossy conversion implicitly adds a line terminator at the end if needed, and strict conversion requires a line terminator at the end if needed.
Form Feed
Pagination control is primarily a feature of higher-level protocols, and not part of most informal notions of "plain text". U+C does have some uses in practice, however it's fairly obscure, and often not recognized.
And even in places where U+C is recognized, there is ambiguity about what it means. Implementations differ on whether it's meant to position the cursor at the beginning of a line in the next page or at its previous column in the next page. And, they differ on whether it should be considered a line terminator.
And, on devices where U+C clears the current screen, that's a significant side effect which could interfere with the visibility of other unrelated data.
So Basic Text excludes U+C. Lossy conversion translates it to U+20 so that it continues to function as whitespace for parsing purposes, but all ambiguity about its meaning is resolved.
Tab
It might be tempting to disallow tab on the basis of it being a control code primarily concerned with how text is aligned on the screen, which is typically considered a feature of higher-level protocols. However, Tab's effects are much more mild than other control codes, and in practice it has several uses, some of which require it, so we allow it.
We can refer to it as just "Tab" though, rather than "Horizontal Tab", since Basic Text excludes Vertical Tab.
Backspace, Delete, Vertical Tab
These do appear in some other "plain text" concepts, however they're rare in practice. Here, plain text is meant to mean text that doesn't include control codes for cursor positioning. Cursor positioning controls are widely used with terminals, but that's a different use case than what Basic Text is targeting.
Alert
Ringing the terminal bell is well outside the scope for plain text, and theoretically could even be used for side-channel communication.
Escape
Escape sequences can cause a wide variety of side effects. Plain text shouldn't be able to have side effects.
Basic Text includes some fairly conservative regular expressions for matching not just the U+1B, but also the sequences which commonly make up escape sequences, such as CSI and OSC, so that entire sequences are cleanly ignored, as is common with unrecognized escape sequences.
Deprecated scalar values
U+149, U+673, U+F77, U+F79, U+17A3, and U+17A4 are officially deprecated, "their use is strongly discouraged", and they have recommended replacements.
U+2329 and U+232A have canonical equivalents with different appearances so their use is deprecated and it's not recommended to automatically replace them with their canonical equivalents. There is a suggested replacement, so the table suggests that, but does not perform the substitution automatically.
Other Deprecated Characters in Markup has additional information.
Unassigned Mathematical Alphanumeric Symbols
In the Mathematical Alphanumeric Symbols block, the scalar value U+1D455 would
be the place for ℎ
, however unicode already had an ℎ
at U+210E, so U+1D455
was left unassigned.
Several other characters are treated similarly: U+9E4, U+9E5, U+A64, U+A65, U+AE4, U+AE5, U+B64, U+B65, U+BE4, U+BE5, U+C64, U+C65, U+CE4, U+CE5, U+D64, U+D65, U+2072, U+2073, U+1D455, U+1D49D, U+1D4A0, U+1D4A1, U+1D4A3, U+1D4A4, U+1D4A7, U+1D4A8, U+1D4AD, U+1D4BA, U+1D4BC, U+1D4C4, U+1D506, U+1D50B, U+1D50C, U+1D515, U+1D51D, U+1D53A, U+1D53F, U+1D545, U+1D547, U+1D548, U+1D549, and U+1D551.
Unicode considers these scalar values unassigned, so they could potentially be assigned new meanings in the future. Consequently, in Basic Text they convert to U+FFFD rather than their designated replacements.
Not-recommended scalar values with singleton canonical decompositions
Unicode recommends the "regular letter" forms be used in preference to the dedicated unit characters for U+2126 OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. They already canonically decompose to the regular letter forms, so they're already excluded from NFC. Rejecting them in strict conversion means that any assumptions about them being handled differently from the regular letter forms will be promptly corrected.
Characters Whose Use Is Discouraged
Khmer scalar values U+17B4 and U+17B5 "should be considered errors in the encoding". Also, "the use of U+17D8 Khmer sign beyyal is discouraged".
For the Cyrillic value U+2DF5, Unicode prefers the sequence U+2DED U+2DEE.
"Forbidden Characters"
There were a few errors in the Unicode normalization algorithm in before Unicode 4.1. The affected scalar values and sequences are identified as Forbidden Characters. However, they are described as being rare in practice, and they're corrected since Unicode 4.1 published in 2005 (and earlier in some gases), they're not restricted here.
"Ghost Characters"
Ghost characters are characters which don't correspond to any existing written characters, and seem to have been created by accident. It's tempting to restrict them, however, Unicode itself has not deprecated them, and it's possible that they'll acquire meanings, so we don't restrict them here.
Hangul Compatibility Jamo
The Hangul Compatibility Jamo block in Unicode is one of the blocks added to Unicode for compatibility with other standards, however it also turns out to be practical, for example for displaying isolated Jamo, so we don't restrict them here.
Noncharacters
Noncharacters are like Private-Use Characters, except they are not intended for interchange. These characters are not widely used, and when they are used, there is often confusion about what they mean or whether they are valid. Since they aren't text, we exclude them here to avoid the confusion.
Also, some implementations are unable to handle U+FFFE because in UTF-16 it can interfere with endianness detection.
See also Noncharacters in Markup.
Along with U+0, U+FFFC, and U+FFF9–U+FFFB, applications wishing to use these for private use should use the plain Unicode format rather than the Basic Text format.
Variation sequences
Basic Text does not restrict the Variation sequences, because Unicode may add new variation sequences over time. Restricted Text excludes the variation sequences entirely.
Characters requiring out-of-band information
Some characters require additional data not described by Unicode to properly display.
U+FFFC (OBJECT REPLACEMENT CHARACTER) has no way to indicate which object it references. See also Object Replacement Character in Markup.
U+FFF9–U+FFFB, the Interlinear Annotation Characters, refer to external information, and ignoring them may change the meaning of a text. See also Interlinear Annotation Characters in Markup.
C1 controls
See Newlines for more information about U+85.
The rest of the C1 controls are non-printing control codes rather than text.
Latin Ligatures
The Latin Ligatures were added to Unicode for round-trip compatibility with other character sets. As explained in the Unicode FAQ on Ligatures, ligatures should be handled by a display system, rather than being encoded in the text itself.
Lira Sign
Unicode says that U+20A4 is interchangeable with U+A3, however it doesn't deprecate or discourage its use. U+20A4 also displays differently from U+A3, having two horizontal lines in the middle rather than one. Consequently, Basic Text permits the Lira Sign.
Musical Controls
The Musical Controls U+1D173–U+1D17A are non-printing control characters, however they can usually be safely ignored by producers that don't support them.
The musical symbols and controls in Unicode are not sufficient to express most forms of music, however higher-level formats built on top of Unicode do use these symbols, in combination with their own specialized markup features, so Basic Text includes them.
See also Musical Controls in Markup.
Tag Characters
U+E0001 is deprecated in Unicode. See Language Tagging in Markup for more information.
The rest of the characters in U+E0000–U+E007F were once deprecated, but no longer are. One new use for them is regional indicator modifiers for national flags. So Basic Text includes them.
Directional Formatting Characters
Unicode specifies several Directional Formatting Characters with special behavior. The "implicit" characters U+200E (LRM), U+200F (RLM), and U+61C (ARM), are supported, however the "explicit characters" U+202A (LRE), U+202B (RLE), U+202D (LRO), U+202E (RLO), U+202C (PDF), U+2066 (LRI), U+2067 (RLI), U+2068 (FSI), and U+2069 (PDI) are not, since they can have non-local display effects.
Also, the explicit characters can be used in a way that depends on U+2029, the Paragraph Separator, for correct interpretation, and Basic Text excludes that scalar value. It can also depend on higher-level protocols, such as table cell boundaries in markup formats, which are outside the scope of Basic Text.
Explicit Directional Overrides also have some security concerns.
In terms of Bidirectional Conformance, Basic Text does not specify how content is displayed or rendered, and does not specify any interactions with higher-level protocols. In terms of Explicit Formatting Characters, Basic Text itself supports "No bidirectional formatting" or "Implicit bidirectionality", but not "Non-isolate bidirectionality" or "Full bidirectionality" by itself.
Users needing explicit bidirectional display control may use higher-level markup languages layered on top, providing Markup And Formatting syntax.
Relationships to other standards and conventions
Relationship to IETF RFC 8264 "PRECIS"
PRECIS is mostly focused on identifiers and has several restrictions that are inappropriate for streams such as disallowing whitespace, but it also has a Freeform Class (4.3) which is similar in spirit to, and one of the inspirations of, the formats defined here.
PRECIS doesn't permit tabs; we include them for the reasons mentioned above.
And, PRECIS doesn't restrict obsolete or discouraged scalar values, so in general it's more permissive than Basic Text.
Relationship to POSIX Text Files and Printable Files
- consists of zero or more lines in POSIX, which all end in newlines
- excludes NUL
- lines are at most
LINE_MAX
bytes long including the newline.
Basic Text excludes NUL (it's a C0 control), and requires content to consist of lines which all end in newlines.
Basic Text has no LINE_MAX
-like restriction.
A printable file in POSIX is a text file which contains no control codes other than whitespace in POSIX (space, tab, newline, carriage-return, form-feed, and vertical-tab) and backspace in POSIX (typically U+8).
Basic Text excludes most of the same control codes. It doesn't include carriage-return, form-feed, vertical-tab, or backspace, as line printer commands aren't part of plain text content.
Relationship to Wikipedia's "plain text"
The plain text format here is intended to align with the use cases described in the Wikipedia article on plain text. The character encoding is known, all characters are either printable or have behavior relevant to simple text display.
Relationship to Unicode's "plain text"
The plain text format here is a more specific version of the Unicode definition of "plain text". Unicode says
Plain text must contain enough information to permit the text to be rendered legibly, and nothing more.
however it include scalars which ring the terminal bell and other side effects, it often includes redundant ways to encode the same logical content, it includes numerous compatibility mechanisms, and it contains flexibility for parties with private agreements.
The Basic Text format here is more focused on being just a plain text format with just enough information to permit the text to be rendered legibly.
Relationship to "What makes a Unicode code point safe?"
The blog post "What makes a Unicode code point safe?" has a list of safety criteria with much in common with the plain text format here. Both exclude unassigned scalar values, noncharacters, private-use characters, surrogate scalar values, and most control codes. And both require text be stable under normalization.
The Basic Text format here permits format characters, whitespace characters, punctuation, and combining characters, as they are commonly used in plain text documents.
The Restricted Text format requires NFKC, which excludes many, though not all, whitespace and formatting characters.
Relationship to "Canonical Equivalence in Applications"
Unicode Technical Note #5 describes various considerations related to normalization, including two alternate normalization forms, called FCD and FCC. We aren't using these here, mainly because we're using NFC (and NFKC) and FCD and FCC aren't fully compatible with NFC.
Relationship to FTFY
FTFY implements several related features.
Not addressed in Basic Text or Restricted Text:
- Mojibake
- HTML entities
- Curly quotation marks
Addressed in Restricted Text and not Basic Text:
- Half-width and full-width characters (via NFKC)
Addressed in Basic Text and Restricted Text:
- Escape sequences
- Control codes
- Ligatures
- Normalization
- Line breaks
- Lone Surrogates
- Byte-Order Marks
Basic Text discards information useful for recovering the original intent from Mojibake. In the future, we may want to add an option to the conversion from Unicode to Basic Text which uses C1 control codes and lone surrogates to guess the intended meaning of an improperly transcoded text.
Relationship to Markup
Unicode in XML and other Markup Languages describes the relationship between Unicode and markup languages. It includes recommendations about Characters not Suitable for use With Markup. Many of these recommendations are incorporated into Basic Text, however some are specific to the needs of markup languages, and Basic Text intends to be useful for plain text as well.
For example, Basic Text includes the Bidi control characters even though they are duplicated by markup features.
Is including NFC the right thing to do?
It's a good question. The following are some notes.
Is normalization inherent?
No, the Stream-Safe and NFC rules in Basic Text conversions are carefully designed to be performed last, so they could conceptually be split out. The question is, whether they should be or not.
Which normalization form?
NFC seems to be by far the most widely used for text interchange, and mostly preserves the meaning of all practical Unicode text (see the following sections for more discussion), so it seems the best choice for the Basic Text format.
Requiring that everything be compatibility-normalized can eliminate several cases of visual ambiguity, and NFKC is a subset of NFC, so it seems the best choice for the Restricted Text format.
What are the advantages of normalizing?
-
Portability - Text that isn't normalized is sometimes interpreted and displayed in different ways, depending on the environment. Normalization ensures that, in aspects related to normalization, content is independent of the environment.
A common argument for non-normalized text is that some fonts render them differently from their normalized counterparts, and users may specifically wish to use the non-normalized versions. However, content that does this may not display properly in other environments using different fonts, so we specifically want to avoid such situations.
-
Avoiding common application bugs - Normalization eliminates some situations where two strings that look the same contain different scalar values, making content easier to work with.
Where are the potential disadvantages?
The following are some notes about various situations where NFC has been considered to be semantically lossy.
CJK Compatibility Ideographs
Unicode includes 1002 CJK Compatibility Ideograph scalar values which were originally intended only for use in preserving round-trip compatibility with other character set standards. However, many of them are associated with slightly different appearances, and this has led to a lot of confusion and some dispute.
For example, the scalar value U+2F8A6 canonically decomposes to U+6148. This means that Unicode considers these two scalar values to be canonically equivalent, such that they are required have the same visual appearance and behavior. Some systems do treat them this way, however many popular systems today display them slightly differently.
Users understandably expect that the difference in appearance is significant and will use non-canonical forms specifically for their unique appearance:
- https://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0216.html
- https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues#Problems_with_canonical_singletons
At one point, the Unicode committee even considered defining "variant normal forms" which would be identical to NFC and NFD except for excluding these CJK Compatibility Ideographs, however did not end up pursuing the idea.
As of Unicode 6.3, all 1002 of these scalar values have standardized
variations which allow them to be normalized into a form which records the
scalar value they were normalized from. Conversion into Basic Text includes a
rule which uses these variation sequences instead of the standard canonical
decompositions, which produces valid NFC output, but unlike plain toNFC
preserves the information about which specific CJK Compatibility Ideographs
were used.
At this time, it appears most implementations don't currently implement these variation sequences, so the characters in this form still unfortunately will often not display as intended. But at least this way, all the information is preserved, so if implementations wish to implement them, they can be displayed as intended.
Biblical Hebrew
According to Unicode, this was once a problem, but there's a fix now.
Bugs in implementations and fonts
Many apparent issues with NFC turn out to be issues with specific implementations or specific fonts, which tend to fade away over time as software is updated. Such issues are not considered here.
An example of this is here.
Greek Polytonic Support
Early versions of Unicode appear to have used a confusing appearance for the TONOS mark, and several fonts developed at the time did as well. See:
Unicode was updated to use a different appearance, and newer fonts seem to use it, and this seems to be a satisfactory solution.
Greek Ano Teleia (U+387)
Unicode canonicalizes the Greek Ano Teleia (U+387) into Middle Dot (U+B7), which doesn't preserve its appearance and creates problems with parsing because the Greek actual ano teleia is considered punctuation, however Middle Dot is considered an identifier character (reflecting its usage in Catalan, for example). See the following links for details:
- http://archives.miloush.net/michkap/archive/2011/05/20/10166588.html
- https://op111.net/2008/03/17/linux-greek-punctuation-ano-teleia/
The Unicode Standard's explanation, in section 7.2 Greek, paragraph Compatibility Punctuation, is:
ISO/IEC 8859-7 and most vendor code pages for Greek simply make use of [...] middle dot for the punctuation in question. Therefore, use of [...] U+387 is not necessary for interoperating with legacy Greek data, and [its] use is not generally encouraged for representation of Greek punctuation.
According to (English) Wikipedia, U+387 is infrequently encountered. And Greek Wikipedia seems to use U+387 and U+B7 interchangeably.
W3C Guidance
The W3C says specs should not specify normalization for storage/interchange:
and suggests an approach where specs normalize only when needed, and ideally only internally to other algorithms that need it.
The rationale can be summed up as:
Normalization can remove distinctions that the users applied intentionally.
As discussed in the above sections, almost all of the places where information about such distinctions seem to be lost either have adequate solutions, or are caused by bugs or missing features in fonts or Unicode implementations.
There is also a difference in priorities; Basic Text is all about building the foundations of a platform for the future, while the W3C's recommendation is about helping users use the Web today. Consequently, Basic Text is more inclined to accept problems if they are believed to merely be limitations of today's environments that can be fixed.
What about the performance impact of NFC normalization?
The performance impact has not yet been evaluated.
One observation is that for text which is already intended to be in normalized form, it should be relatively cheap to simplify verify that, and this should ideally be a very common case.