lexical structure: move the description of BOM-removal

This takes place at the same time as CRLF normalisation. It's better not to list it in a Lexer block, as it isn't a token that can be fed to a macro.
2024-01-28 17:22:53 +00:00 · 2024-01-28 17:22:53 +00:00 · 5f512692d3
parent fa56fdba0e
commit 5f512692d3
2 changed files with 7 additions and 11 deletions
--- a/src/crates-and-source-files.md
+++ b/src/crates-and-source-files.md
@ -2,13 +2,11 @@

 > **<sup>Syntax</sup>**\
 > _Crate_ :\
-> &nbsp;&nbsp; UTF8BOM<sup>?</sup>\
 > &nbsp;&nbsp; SHEBANG<sup>?</sup>\
 > &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
 > &nbsp;&nbsp; [_Item_]<sup>\*</sup>

 > **<sup>Lexer</sup>**\
-> UTF8BOM : `\uFEFF`\
 > SHEBANG : `#!` \~`\n`<sup>\+</sup>[†](#shebang)


@ -65,19 +63,13 @@ apply to the crate as a whole.
 #![warn(non_camel_case_types)]
 ```

-## Byte order mark
-
-The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
-file is encoded in UTF8. It can only occur at the beginning of the file and
-is ignored by the compiler.
-
 ## Shebang

 A source file can have a [_shebang_] (SHEBANG production), which indicates
 to the operating system what program to use to execute this file. It serves
 essentially to treat the source file as an executable script. The shebang
-can only occur at the beginning of the file (but after the optional
-_UTF8BOM_). It is ignored by the compiler. For example:
+can only occur at the beginning of the file.
+It is ignored by the compiler. For example:

 <!-- ignore: tests don't like shebang -->
 ```rust,ignore
@ -162,7 +154,6 @@ or `_` (U+005F) characters.
 [_Item_]: items.md
 [_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
 [_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
-[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [`ExitCode`]: ../std/process/struct.ExitCode.html
 [`Infallible`]: ../std/convert/enum.Infallible.html
 [`Termination`]: ../std/process/trait.Termination.html
--- a/src/input-format.md
+++ b/src/input-format.md
@ -9,6 +9,10 @@ See [Crates and source files] for a description of how programs are organised in
 Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
 It is an error if the file is not valid UTF-8.

+## Byte order mark removal
+
+If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
+
 ## CRLF normalization

 Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
@ -19,4 +23,5 @@ Other occurrences of the character `U+000D` (CR) are left in place (they are tre

 The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.

+[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [Crates and source files]: crates-and-source-files.md