Basic Text

The Basic Text format is a subset of the Unicode format and meant to fulfill common notions of "plain text". It is not yet standardized anywhere, and may evolve, but it is usable for many purposes.

Basic Text permits homoglyphs and other visual ambiguities; see Restricted Text for an alternative which might provide some mitigations.

For rationale and background information, see Background. For a prototype implementation, see the Github repo.

Definitions

A string is in Basic Text form iff:

A stream is in Basic Text form iff:

  • it consists entirely of a string in Basic Text form, and
  • it is empty or ends with U+A.

Supplementary definitions

Basic Text non-starter

A Unicode scalar value is a Basic Text non-starter iff:

Basic Text non-ender

A Unicode scalar value is a Basic Text non-ender iff:

  • its Grapheme_Cluster_Break is ZWJ or Prepend.

Sequence Table

SequenceakaReplacementError
U+D U+ACRLFU+A"Use U+A to terminate a line"
U+DCRU+A"Use U+A to terminate a line"
[U+C]+ U+D U+AU+A"Control code not valid in text"
[U+C]+ U+AU+A"Control code not valid in text"
[U+C]+ U+DU+A"Control code not valid in text"
[U+C]+FFU+20"Control code not valid in text"
U+1B U+5B [U+20–U+3F]* U+6DSGR"Color escape sequences are not enabled"
[U+1B]+ U+5B U+5B [U+–U+7F]?"Unrecognized escape sequence"
[U+1B]+ U+5B [U+20–U+3F]* [U+40–U+7E]?CSI"Unrecognized escape sequence"
[U+1B]+ U+5D [^U+7,U+18,U+1B]* [U+7,U+18]?OSC"Unrecognized escape sequence"
[U+1B]+ [U+40–U+7E]ESC"Unrecognized escape sequence"
[U+1B]+ESC"Escape code not valid in text"
[U+0–U+8,U+B,U+E–U+1F]C0U+FFFD"Control code not valid in text"
U+7FDELU+FFFD"Control code not valid in text"
U+85NELU+20"Control code not valid in text"
[U+80–U+84,U+86–U+9F]C1U+FFFD"Control code not valid in text"
U+149ʼnU+2BC U+6E"Use U+2BC U+6E instead of U+149"
U+673ا ٟU+627 U+65F"Use U+627 U+65F instead of U+673"
U+9E4U+FFFD"Use U+964 instead of U+9E4"
U+9E5U+FFFD"Use U+965 instead of U+9E5"
U+A64U+FFFD"Use U+964 instead of U+A64"
U+A65U+FFFD"Use U+965 instead of U+A65"
U+AE4U+FFFD"Use U+964 instead of U+AE4"
U+AE5U+FFFD"Use U+965 instead of U+AE5"
U+B64U+FFFD"Use U+964 instead of U+B64"
U+B65U+FFFD"Use U+965 instead of U+B65"
U+BE4U+FFFD"Use U+964 instead of U+BE4"
U+BE5U+FFFD"Use U+965 instead of U+BE5"
U+C64U+FFFD"Use U+964 instead of U+C64"
U+C65U+FFFD"Use U+965 instead of U+C65"
U+CE4U+FFFD"Use U+964 instead of U+CE4"
U+CE5U+FFFD"Use U+965 instead of U+CE5"
U+D64U+FFFD"Use U+964 instead of U+D64"
U+D65U+FFFD"Use U+965 instead of U+D65"
U+F77◌ྲ◌ཱྀU+FB2 U+F71 U+F80"Use U+FB2 U+F71 U+F80 instead of U+F77"
U+F79◌ླ◌ཱྀU+FB3 U+F71 U+F80"Use U+FB3 U+F71 U+F80 instead of U+F79"
U+17A3U+17A2"Use U+17A2 instead of U+17A3"
U+17A4អាU+17A2 U+17B6"Use U+17A2 U+17B6 instead of U+17A4"
U+17B4U+FFFD"Omit U+17B4"
U+17B5U+FFFD"Omit U+17B5"
U+17D8U+FFFD"Spell beyyal with normal letters"
U+2028LSU+20"Line separation is a rich-text function"
U+2029PSU+20"Paragraph separation is a rich-text function"
U+202ALREU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+202BRLEU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+202CPDFU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+202DLROU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+202ERLOU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+2066LRIU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+2067RLIU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+2068FSIU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
U+2069PDIU+FFFD"Explicit Bidirectional Formatting Characters are unsupported"
[U+206A–U+206F]U+FFFD"Deprecated Format Characters are deprecated"
U+2072²U+FFFD"Use U+B2 instead of U+2072"
U+2073³U+FFFD"Use U+B3 instead of U+2073"
U+2126ΩU+3A9"Use U+3A9 instead of U+2126"
U+212AKU+4B"Use U+4B instead of U+212A"
U+212BÅU+C5"Use U+C5 instead of U+212B"
U+2329U+FFFD"Use U+27E8 instead of U+2329"
U+232AU+FFFD"Use U+27E9 instead of U+232A"
U+2DF5 ⷭⷮU+2DED U+2DEE"Use U+2DED U+2DEE instead of U+2DF5"
[U+F900–U+FA0D,U+FA10,U+FA12,U+FA15–U+FA1E,U+FA20,U+FA22,U+FA25–U+FA26,U+FA2A–U+FA6D,U+FA70–U+FAD9]CJK compatibility ideograph Standardized Variant"Use Standardized Variants instead of CJK Compatibility Ideographs"
U+FB00ffU+66 U+66"Use U+66 U+66 instead of U+FB00"
U+FB01fiU+66 U+69"Use U+66 U+69 instead of U+FB01"
U+FB02flU+66 U+6C"Use U+66 U+6C instead of U+FB02"
U+FB03ffiU+66 U+66 U+66"Use U+66 U+66 U+69 instead of U+FB03"
U+FB04fflU+66 U+66 U+6C"Use U+66 U+66 U+6C instead of U+FB04"
U+FB05ſtU+17F U+74"Use U+17F U+74 instead of U+FB05"
U+FB06stU+73 U+74"Use U+73 U+74 instead of U+FB06"
[U+FDD0–U+FDEF]U+FFFD"Noncharacters are intended for internal use only"
U+FEFFBOMU+2060"U+FEFF is not necessary in Basic Text"
[U+FFF9–U+FFFB]U+FFFD"Interlinear Annotations depend on out-of-band information"
U+FFFCORCU+FFFD"U+FFFC depends on out-of-band information"
[U+FFFE,U+FFFF]U+FFFD"Noncharacters are intended for internal use only"
U+1D455U+FFFD"Use U+210E instead of U+1D455"
U+1D49DU+FFFD"Use U+212C instead of U+1D49D"
U+1D4A0U+FFFD"Use U+2130 instead of U+1D4A0"
U+1D4A1U+FFFD"Use U+2131 instead of U+1D4A1"
U+1D4A3U+FFFD"Use U+210B instead of U+1D4A3"
U+1D4A4U+FFFD"Use U+2110 instead of U+1D4A4"
U+1D4A7U+FFFD"Use U+2112 instead of U+1D4A7"
U+1D4A8U+FFFD"Use U+2133 instead of U+1D4A8"
U+1D4ADU+FFFD"Use U+211B instead of U+1D4AD"
U+1D4BAU+FFFD"Use U+212F instead of U+1D4BA"
U+1D4BCU+FFFD"Use U+210A instead of U+1D4BC"
U+1D4C4U+FFFD"Use U+2134 instead of U+1D4C4"
U+1D506U+FFFD"Use U+212D instead of U+1D506"
U+1D50BU+FFFD"Use U+210C instead of U+1D50B"
U+1D50CU+FFFD"Use U+2111 instead of U+1D50C"
U+1D515U+FFFD"Use U+211C instead of U+1D515"
U+1D51DU+FFFD"Use U+2128 instead of U+1D51D"
U+1D53AU+FFFD"Use U+2102 instead of U+1D53A"
U+1D53FU+FFFD"Use U+210D instead of U+1D53F"
U+1D545U+FFFD"Use U+2115 instead of U+1D545"
U+1D547U+FFFD"Use U+2119 instead of U+1D547"
U+1D548U+FFFD"Use U+211A instead of U+1D548"
U+1D549U+FFFD"Use U+211D instead of U+1D549"
U+1D551U+FFFD"Use U+2124 instead of U+1D551"
[U+1FFFE,U+1FFFF]U+FFFD"Noncharacters are intended for internal use only"
U+111C4𑆏𑆀U+1118F U+11180"Use U+1118F U+11180 instead of U+111C4"
[U+2F800–U+2FA1D]CJK compatibility ideograph Standardized Variant"Use Standardized Variants instead of CJK Compatibility Ideographs"
[U+2FFFE,U+2FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+3FFFE,U+3FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+4FFFE,U+4FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+5FFFE,U+5FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+6FFFE,U+6FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+7FFFE,U+7FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+8FFFE,U+8FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+9FFFE,U+9FFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+AFFFE,U+AFFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+BFFFE,U+BFFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+CFFFE,U+CFFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+DFFFE,U+DFFFF]U+FFFD"Noncharacters are intended for internal use only"
U+E0001U+FFFD"Language tagging is a deprecated mechanism"
[U+EFFFE,U+EFFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+FFFFE,U+FFFFF]U+FFFD"Noncharacters are intended for internal use only"
[U+10FFFE,U+10FFFF]U+FFFD"Noncharacters are intended for internal use only"

Conversion

From Unicode string to Basic Text string

To convert a Unicode string into a Basic Text string in a manner that always succeeds, discarding information not usually considered meaningful or valid in plain text:

Options

The following options may be enabled:

NameTypeDefault
NEL CompatibilityBooleanfalse
LSPS CompatibilityBooleanfalse

From Unicode string to Basic Text string, strict

To convert a Unicode string into a Basic Text string in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text:

  • If the string starts with a Basic Text non-starter, error with "Basic Text string must not begin with Basic Text non-starter".
  • If the string ends with a Basic Text non-ender, error with "Basic Text string must not end with Basic Text non-ender".
  • Perform the Error actions from the Sequence Table.
  • When CRLF Compatibility is enabled, replace any U+A with U+D U+A.
  • For any scalar value with General_Category of Unassigned that isn't already preceded by U+34F, insert U+34F before it.
  • For any scalar value with General_Category of Unassigned that isn't already followed by U+34F, insert U+34F after it.
  • Perform the Stream-Safe Text Process (UAX15-D4).
  • Perform toNFC with the Normalization Process.

Options

The following options may be enabled:

NameTypeDefault
CRLF CompatibilityBooleanfalse

From Unicode stream to Basic Text stream

To convert a Unicode stream into a Basic Text stream in a manner than always succeeds, discarding information not usually considered meaningful or valid in plain text:

From Unicode stream to Basic Text stream, strict

To convert a Unicode stream into a Basic Text stream in a manner that discards information not usually considered meaningful and otherwise fails if the content is not valid Basic Text:

Options

The following options may be enabled:

NameTypeDefault
BOM CompatibilityBooleanfalse