Introduction

Basic Text is a subset of Unicode intending to simplify plain text use cases, so that consumers don't need to worry about control codes, deprecated scalar values, newline conventions, or other complexities.

This book contains the current specification draft and supporting documentation. There is also a Github repo containing a prototype implementation.

Background

Plain text is an intuitive concept that plays an important role in computing.

Intuitively, plain text content shouldn't have side effects. It's just text. However, Unicode contains multiple sets of control codes which effectively form a bytecode language, with a variety of loosely-defined and often non-standardized side effects. Historically, text content and in-band control codes were often mixed together in the same encoding standards, and now, Unicode itself must maintain compatibility with those standards. And, many of them continue to be recognized, particularly in virtual terminals, such that one must be careful about even displaying text from untrusted sources, for example in CVE-2017-10906, CVE-2019-8325, and others.

Other than a few control codes for line endings and a few other things, most of these control codes are obsolete and almost never used for plain text use cases. For cases where we want to work with plain text, and be sure it matches our intuitive sense of what text is, it would be useful to have a checkable subset of Unicode, which excludes problematic control characters, and other things that are not necessary for modern practical plain text use cases.

And if we're defining a Unicode subset for plain text, we also have an opportunity to rationalize line endings so that users don't have to think about the old CRLF vs LF problem. And we can restrict scalar values that Unicode itself deprecates or discourages, but which Unicode itself can't drop because of its need for round-trip compatibility with other character sets, so that consumers that only need to work with Unicode have fewer things to think about.

There are existing standard subsets of Unicode doing similar things, such as PRECIS FreeformClass, printable files in POSIX, and others, however they either restrict scalar values that are frequently used in "plain text" content, or don't restrict deprecated scalar values.

Basic Text is a subset of Unicode aiming to make it as simple as possible (but no simpler) to work with plain text. It is not yet standardized anywhere, and may evolve, but it is usable for many purposes.

Development

Development of these pages, as well as a prototype implementation, is hosted in the Github repo.

Restricted Text

Basic Text can still be visually ambiguous. There are numerous ways that two different scalar value sequences can have identical or similar appearances, such as scalar values representing Cyrillic and Latin letters.

There are several techniques for mitigating visual ambiguities, but some of them are too restrictive for general-purpose plain text, and thus too restrictive for Basic Text.

So in addition to Basic Text, this book also describes a hypothetical format called Restricted Text, which collects such mitigation techniques, and could in theory be developed into an actual format in the future.