From fa56fdba0e9dba35eb29d11c95c7a009ed67cb35 Mon Sep 17 00:00:00 2001
From: Matthew Woodcraft <matthew@woodcraft.me.uk>
Date: Sat, 27 Jan 2024 23:49:41 +0000
Subject: [PATCH 1/4] Lexical structure: move the description of CRLF
 normalization

We now say that CRLF normalization happens as a separate pass before
tokenization.
---
 src/comments.md     |  7 +++---
 src/input-format.md | 21 +++++++++++++++++-
 src/tokens.md       | 52 ++++++++++++++++++++++++++-------------------
 3 files changed, 54 insertions(+), 26 deletions(-)

diff --git a/src/comments.md b/src/comments.md
index bf1e7ca..795bf63 100644
--- a/src/comments.md
+++ b/src/comments.md
@@ -30,7 +30,7 @@
 > &nbsp;&nbsp; | INNER_BLOCK_DOC
 >
 > _IsolatedCR_ :\
-> &nbsp;&nbsp; _A `\r` not followed by a `\n`_
+> &nbsp;&nbsp; \\r
 
 ## Non-doc comments
 
@@ -53,8 +53,9 @@ that follows.  That is, they are equivalent to writing `#![doc="..."]` around
 the body of the comment. `//!` comments are usually used to document
 modules that occupy a source file.
 
-Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
-comments.
+The character `U+000D` (CR) is not allowed in doc comments.
+
+> **Note**:  The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).
 
 ## Examples
 
diff --git a/src/input-format.md b/src/input-format.md
index 678902c..4833165 100644
--- a/src/input-format.md
+++ b/src/input-format.md
@@ -1,3 +1,22 @@
 # Input format
 
-Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
+This chapter describes how a source file is interpreted as a sequence of tokens.
+
+See [Crates and source files] for a description of how programs are organised into files.
+
+## Source encoding
+
+Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
+It is an error if the file is not valid UTF-8.
+
+## CRLF normalization
+
+Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
+
+Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
+
+## Tokenization
+
+The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
+
+[Crates and source files]: crates-and-source-files.md
diff --git a/src/tokens.md b/src/tokens.md
index 0911296..9507ef7 100644
--- a/src/tokens.md
+++ b/src/tokens.md
@@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].
 
 [^nsets]: The number of `#`s on each side of the same literal must be equivalent.
 
+> **Note**:  Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
+
 #### ASCII escapes
 
 |   | Name |
@@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
 `U+0022` (double-quote) characters, with the exception of `U+0022` itself,
 which must be _escaped_ by a preceding `U+005C` character (`\`).
 
-Line-breaks are allowed in string literals.
-A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
-Both byte sequences are translated to `U+000A`.
-
+Line-breaks, represented by the  character `U+000A` (LF), are allowed in string literals.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-
+The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
 
 #### Character escapes
 
@@ -198,10 +197,10 @@ following forms:
 
 Raw string literals do not process any escapes. They start with the character
 `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
-`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
-of Unicode characters and is terminated only by another `U+0022` (double-quote)
-character, followed by the same number of `U+0023` (`#`) characters that preceded
-the opening `U+0022` (double-quote) character.
+`U+0022` (double-quote) character.
+
+The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
 
 All Unicode characters contained in the raw string body represent themselves,
 the characters `U+0022` (double-quote) (except when followed by at least as
@@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
 Alternatively, a byte string literal can be a _raw byte string literal_, defined
 below.
 
+Line-breaks, represented by the  character `U+000A` (LF), are allowed in byte string literals.
+When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
+See [String continuation escapes] for details.
+The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
+
 Some additional _escapes_ are available in either byte or non-raw byte string
 literals. An escape starts with a `U+005C` (`\`) and continues with one of the
 following forms:
@@ -281,19 +285,19 @@ following forms:
 > &nbsp;&nbsp; `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
 >
 > RAW_BYTE_STRING_CONTENT :\
-> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII<sup>* (non-greedy)</sup> `"`\
+> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
 > &nbsp;&nbsp; | `#` RAW_BYTE_STRING_CONTENT `#`
 >
-> ASCII :\
-> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F)_
+> ASCII_FOR_RAW :\
+> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
 
 Raw byte string literals do not process any escapes. They start with the
 character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
-of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
-_raw string body_ can contain any sequence of ASCII characters and is terminated
-only by another `U+0022` (double-quote) character, followed by the same number of
-`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
-character. A raw byte string literal can not contain any non-ASCII byte.
+of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
+
+The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
+A raw byte string literal can not contain any non-ASCII byte.
 
 All characters contained in the raw string body represent their ASCII encoding,
 the characters `U+0022` (double-quote) (except when followed by at least as
@@ -340,6 +344,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
 literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
 permitted within a C string.
 
+Line-breaks, represented by the  character `U+000A` (LF), are allowed in C string literals.
+When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
+See [String continuation escapes] for details.
+The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
+
 Some additional _escapes_ are available in non-raw C string literals. An escape
 starts with a `U+005C` (`\`) and continues with one of the following forms:
 
@@ -382,11 +391,10 @@ c"\xC3\xA6";
 
 Raw C string literals do not process any escapes. They start with the
 character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
-of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
-_raw C string body_ can contain any sequence of Unicode characters (other than
-`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
-followed by the same number of `U+0023` (`#`) characters that preceded the
-opening `U+0022` (double-quote) character.
+of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
+
+The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
+It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
 
 All characters contained in the raw C string body represent themselves in UTF-8
 encoding. The characters `U+0022` (double-quote) (except when followed by at

From 5f512692d327fed3aeb31e16412d42c42a21101a Mon Sep 17 00:00:00 2001
From: Matthew Woodcraft <matthew@woodcraft.me.uk>
Date: Sun, 28 Jan 2024 17:22:53 +0000
Subject: [PATCH 2/4] lexical structure: move the description of BOM-removal

This takes place at the same time as CRLF normalisation.

It's better not to list it in a Lexer block, as it isn't a token that can be
fed to a macro.
---
 src/crates-and-source-files.md | 13 ++-----------
 src/input-format.md            |  5 +++++
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/src/crates-and-source-files.md b/src/crates-and-source-files.md
index 8d54c3f..5b87519 100644
--- a/src/crates-and-source-files.md
+++ b/src/crates-and-source-files.md
@@ -2,13 +2,11 @@
 
 > **<sup>Syntax</sup>**\
 > _Crate_ :\
-> &nbsp;&nbsp; UTF8BOM<sup>?</sup>\
 > &nbsp;&nbsp; SHEBANG<sup>?</sup>\
 > &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
 > &nbsp;&nbsp; [_Item_]<sup>\*</sup>
 
 > **<sup>Lexer</sup>**\
-> UTF8BOM : `\uFEFF`\
 > SHEBANG : `#!` \~`\n`<sup>\+</sup>[†](#shebang)
 
 
@@ -65,19 +63,13 @@ apply to the crate as a whole.
 #![warn(non_camel_case_types)]
 ```
 
-## Byte order mark
-
-The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
-file is encoded in UTF8. It can only occur at the beginning of the file and
-is ignored by the compiler.
-
 ## Shebang
 
 A source file can have a [_shebang_] (SHEBANG production), which indicates
 to the operating system what program to use to execute this file. It serves
 essentially to treat the source file as an executable script. The shebang
-can only occur at the beginning of the file (but after the optional
-_UTF8BOM_). It is ignored by the compiler. For example:
+can only occur at the beginning of the file.
+It is ignored by the compiler. For example:
 
 <!-- ignore: tests don't like shebang -->
 ```rust,ignore
@@ -162,7 +154,6 @@ or `_` (U+005F) characters.
 [_Item_]: items.md
 [_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
 [_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
-[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [`ExitCode`]: ../std/process/struct.ExitCode.html
 [`Infallible`]: ../std/convert/enum.Infallible.html
 [`Termination`]: ../std/process/trait.Termination.html
diff --git a/src/input-format.md b/src/input-format.md
index 4833165..df41557 100644
--- a/src/input-format.md
+++ b/src/input-format.md
@@ -9,6 +9,10 @@ See [Crates and source files] for a description of how programs are organised in
 Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
 It is an error if the file is not valid UTF-8.
 
+## Byte order mark removal
+
+If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
+
 ## CRLF normalization
 
 Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
@@ -19,4 +23,5 @@ Other occurrences of the character `U+000D` (CR) are left in place (they are tre
 
 The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
 
+[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [Crates and source files]: crates-and-source-files.md

From e364b6c6f91a7166ab4ff6a7814bf9a4922c2358 Mon Sep 17 00:00:00 2001
From: Matthew Woodcraft <matthew@woodcraft.me.uk>
Date: Sun, 28 Jan 2024 18:10:27 +0000
Subject: [PATCH 3/4] lexical structure: move the description of
 shebang-removal

This takes place after CRLF normalization.

It's better not to list the shebang in a Lexer block, as it isn't a token that
can be fed to a macro.
---
 src/crates-and-source-files.md | 33 +++------------------------------
 src/input-format.md            | 22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/src/crates-and-source-files.md b/src/crates-and-source-files.md
index 5b87519..2373a79 100644
--- a/src/crates-and-source-files.md
+++ b/src/crates-and-source-files.md
@@ -2,14 +2,9 @@
 
 > **<sup>Syntax</sup>**\
 > _Crate_ :\
-> &nbsp;&nbsp; SHEBANG<sup>?</sup>\
 > &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
 > &nbsp;&nbsp; [_Item_]<sup>\*</sup>
 
-> **<sup>Lexer</sup>**\
-> SHEBANG : `#!` \~`\n`<sup>\+</sup>[†](#shebang)
-
-
 > Note: Although Rust, like any other language, can be implemented by an
 > interpreter as well as a compiler, the only existing implementation is a
 > compiler, and the language has always been designed to be compiled. For these
@@ -51,6 +46,8 @@ that apply to the containing module, most of which influence the behavior of
 the compiler. The anonymous crate module can have additional attributes that
 apply to the crate as a whole.
 
+> **Note**: The file's contents may be preceded by a [shebang].
+
 ```rust
 // Specify the crate name.
 #![crate_name = "projx"]
@@ -63,28 +60,6 @@ apply to the crate as a whole.
 #![warn(non_camel_case_types)]
 ```
 
-## Shebang
-
-A source file can have a [_shebang_] (SHEBANG production), which indicates
-to the operating system what program to use to execute this file. It serves
-essentially to treat the source file as an executable script. The shebang
-can only occur at the beginning of the file.
-It is ignored by the compiler. For example:
-
-<!-- ignore: tests don't like shebang -->
-```rust,ignore
-#!/usr/bin/env rustx
-
-fn main() {
-    println!("Hello!");
-}
-```
-
-A restriction is imposed on the shebang syntax to avoid confusion with an
-[attribute]. The `#!` characters must not be followed by a `[` token, ignoring
-intervening [comments] or [whitespace]. If this restriction fails, then it is
-not treated as a shebang, but instead as the start of an attribute.
-
 ## Preludes and `no_std`
 
 This section has been moved to the [Preludes chapter](names/preludes.md).
@@ -153,19 +128,17 @@ or `_` (U+005F) characters.
 [_InnerAttribute_]: attributes.md
 [_Item_]: items.md
 [_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
-[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
 [`ExitCode`]: ../std/process/struct.ExitCode.html
 [`Infallible`]: ../std/convert/enum.Infallible.html
 [`Termination`]: ../std/process/trait.Termination.html
 [attribute]: attributes.md
 [attributes]: attributes.md
-[comments]: comments.md
 [function]: items/functions.md
 [module]: items/modules.md
 [module path]: paths.md
+[shebang]: input-format.md#shebang-removal
 [trait or lifetime bounds]: trait-bounds.md
 [where clauses]: items/generics.md#where-clauses
-[whitespace]: whitespace.md
 
 <script>
 (function() {
diff --git a/src/input-format.md b/src/input-format.md
index df41557..a9f2c90 100644
--- a/src/input-format.md
+++ b/src/input-format.md
@@ -19,9 +19,31 @@ Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is r
 
 Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
 
+## Shebang removal
+
+If the remaining sequence begins with the characters `!#`, the characters up to and including the first `U+000A` (LF) are removed from the sequence.
+
+For example, the first line of the following file would be ignored:
+
+<!-- ignore: tests don't like shebang -->
+```rust,ignore
+#!/usr/bin/env rustx
+
+fn main() {
+    println!("Hello!");
+}
+```
+
+As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
+This prevents an [inner attribute] at the start of a source file being removed.
+
 ## Tokenization
 
 The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
 
+[inner attribute]: attributes.md
 [BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
+[comments]: comments.md
 [Crates and source files]: crates-and-source-files.md
+[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
+[whitespace]: whitespace.md

From 8ba3c4911446cb390eb0602862caf53fac6da086 Mon Sep 17 00:00:00 2001
From: Matthew Woodcraft <matthew@woodcraft.me.uk>
Date: Sun, 28 Jan 2024 18:30:26 +0000
Subject: [PATCH 4/4] Input format: note about include! macros

---
 src/input-format.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/input-format.md b/src/input-format.md
index a9f2c90..946e678 100644
--- a/src/input-format.md
+++ b/src/input-format.md
@@ -37,10 +37,16 @@ fn main() {
 As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
 This prevents an [inner attribute] at the start of a source file being removed.
 
+> **Note**: The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not.
+
 ## Tokenization
 
 The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
 
+
+[`include!`]: ../std/macro.include.md
+[`include_bytes!`]: ../std/macro.include_bytes.md
+[`include_str!`]: ../std/macro.include_str.md
 [inner attribute]: attributes.md
 [BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [comments]: comments.md