Lexical structure: move the description of CRLF normalization

We now say that CRLF normalization happens as a separate pass before tokenization.
2024-01-27 23:49:41 +00:00 · 2024-01-27 23:49:41 +00:00 · fa56fdba0e
parent a0b119535e
commit fa56fdba0e
3 changed files with 54 additions and 26 deletions
--- a/src/comments.md
+++ b/src/comments.md
@ -30,7 +30,7 @@
 > &nbsp;&nbsp; | INNER_BLOCK_DOC
 >
 > _IsolatedCR_ :\
-> &nbsp;&nbsp; _A `\r` not followed by a `\n`_
+> &nbsp;&nbsp; \\r

 ## Non-doc comments

@ -53,8 +53,9 @@ that follows.  That is, they are equivalent to writing `#![doc="..."]` around
 the body of the comment. `//!` comments are usually used to document
 modules that occupy a source file.

-Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
-comments.
+The character `U+000D` (CR) is not allowed in doc comments.
+
+> **Note**:  The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).

 ## Examples

--- a/src/input-format.md
+++ b/src/input-format.md
@ -1,3 +1,22 @@
 # Input format

-Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
+This chapter describes how a source file is interpreted as a sequence of tokens.
+
+See [Crates and source files] for a description of how programs are organised into files.
+
+## Source encoding
+
+Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
+It is an error if the file is not valid UTF-8.
+
+## CRLF normalization
+
+Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
+
+Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
+
+## Tokenization
+
+The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
+
+[Crates and source files]: crates-and-source-files.md
--- a/src/tokens.md
+++ b/src/tokens.md
@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].

 [^nsets]: The number of `#`s on each side of the same literal must be equivalent.

+> **Note**:  Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
+
 #### ASCII escapes

 |   | Name |
@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
 `U+0022` (double-quote) characters, with the exception of `U+0022` itself,
 which must be _escaped_ by a preceding `U+005C` character (`\`).

-Line-breaks are allowed in string literals.
-A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
-Both byte sequences are translated to `U+000A`.
-
+Line-breaks, represented by the  character `U+000A` (LF), are allowed in string literals.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-
+The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.

 #### Character escapes

@ -198,10 +197,10 @@ following forms:

 Raw string literals do not process any escapes. They start with the character
 `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
-`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
-of Unicode characters and is terminated only by another `U+0022` (double-quote)
-character, followed by the same number of `U+0023` (`#`) characters that preceded
-the opening `U+0022` (double-quote) character.
+`U+0022` (double-quote) character.
+
+The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.

 All Unicode characters contained in the raw string body represent themselves,
 the characters `U+0022` (double-quote) (except when followed by at least as
@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
 Alternatively, a byte string literal can be a _raw byte string literal_, defined
 below.

+Line-breaks, represented by the  character `U+000A` (LF), are allowed in byte string literals.
+When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
+See [String continuation escapes] for details.
+The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
+
 Some additional _escapes_ are available in either byte or non-raw byte string
 literals. An escape starts with a `U+005C` (`\`) and continues with one of the
 following forms:
@ -281,19 +285,19 @@ following forms:
 > &nbsp;&nbsp; `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
 >
 > RAW_BYTE_STRING_CONTENT :\
-> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII<sup>* (non-greedy)</sup> `"`\
+> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
 > &nbsp;&nbsp; | `#` RAW_BYTE_STRING_CONTENT `#`
 >
-> ASCII :\
-> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F)_
+> ASCII_FOR_RAW :\
+> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_

 Raw byte string literals do not process any escapes. They start with the
 character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
-of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
-_raw string body_ can contain any sequence of ASCII characters and is terminated
-only by another `U+0022` (double-quote) character, followed by the same number of
-`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
-character. A raw byte string literal can not contain any non-ASCII byte.
+of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
+
+The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
+A raw byte string literal can not contain any non-ASCII byte.

 All characters contained in the raw string body represent their ASCII encoding,
 the characters `U+0022` (double-quote) (except when followed by at least as
@ -340,6 +344,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
 literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
 permitted within a C string.

+Line-breaks, represented by the  character `U+000A` (LF), are allowed in C string literals.
+When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
+See [String continuation escapes] for details.
+The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
+
 Some additional _escapes_ are available in non-raw C string literals. An escape
 starts with a `U+005C` (`\`) and continues with one of the following forms:

@ -382,11 +391,10 @@ c"\xC3\xA6";

 Raw C string literals do not process any escapes. They start with the
 character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
-of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
-_raw C string body_ can contain any sequence of Unicode characters (other than
-`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
-followed by the same number of `U+0023` (`#`) characters that preceded the
-opening `U+0022` (double-quote) character.
+of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
+
+The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.

 All characters contained in the raw C string body represent themselves in UTF-8
 encoding. The characters `U+0022` (double-quote) (except when followed by at