From fa56fdba0e9dba35eb29d11c95c7a009ed67cb35 Mon Sep 17 00:00:00 2001 From: Matthew Woodcraft Date: Sat, 27 Jan 2024 23:49:41 +0000 Subject: [PATCH 1/4] Lexical structure: move the description of CRLF normalization We now say that CRLF normalization happens as a separate pass before tokenization. --- src/comments.md | 7 +++--- src/input-format.md | 21 +++++++++++++++++- src/tokens.md | 52 ++++++++++++++++++++++++++------------------- 3 files changed, 54 insertions(+), 26 deletions(-) diff --git a/src/comments.md b/src/comments.md index bf1e7ca..795bf63 100644 --- a/src/comments.md +++ b/src/comments.md @@ -30,7 +30,7 @@ >    | INNER_BLOCK_DOC > > _IsolatedCR_ :\ ->    _A `\r` not followed by a `\n`_ +>    \\r ## Non-doc comments @@ -53,8 +53,9 @@ that follows. That is, they are equivalent to writing `#![doc="..."]` around the body of the comment. `//!` comments are usually used to document modules that occupy a source file. -Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc -comments. +The character `U+000D` (CR) is not allowed in doc comments. + +> **Note**: The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF). ## Examples diff --git a/src/input-format.md b/src/input-format.md index 678902c..4833165 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -1,3 +1,22 @@ # Input format -Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8. +This chapter describes how a source file is interpreted as a sequence of tokens. + +See [Crates and source files] for a description of how programs are organised into files. + +## Source encoding + +Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. +It is an error if the file is not valid UTF-8. + +## CRLF normalization + +Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). + +Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]). + +## Tokenization + +The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter. + +[Crates and source files]: crates-and-source-files.md diff --git a/src/tokens.md b/src/tokens.md index 0911296..9507ef7 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions]. [^nsets]: The number of `#`s on each side of the same literal must be equivalent. +> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF). + #### ASCII escapes | | Name | @@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two `U+0022` (double-quote) characters, with the exception of `U+0022` itself, which must be _escaped_ by a preceding `U+005C` character (`\`). -Line-breaks are allowed in string literals. -A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`). -Both byte sequences are translated to `U+000A`. - +Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. - +The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape. #### Character escapes @@ -198,10 +197,10 @@ following forms: Raw string literals do not process any escapes. They start with the character `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a -`U+0022` (double-quote) character. The _raw string body_ can contain any sequence -of Unicode characters and is terminated only by another `U+0022` (double-quote) -character, followed by the same number of `U+0023` (`#`) characters that preceded -the opening `U+0022` (double-quote) character. +`U+0022` (double-quote) character. + +The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR). +It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. All Unicode characters contained in the raw string body represent themselves, the characters `U+0022` (double-quote) (except when followed by at least as @@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character. Alternatively, a byte string literal can be a _raw byte string literal_, defined below. +Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals. +When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. +See [String continuation escapes] for details. +The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape. + Some additional _escapes_ are available in either byte or non-raw byte string literals. An escape starts with a `U+005C` (`\`) and continues with one of the following forms: @@ -281,19 +285,19 @@ following forms: >    `br` RAW_BYTE_STRING_CONTENT SUFFIX? > > RAW_BYTE_STRING_CONTENT :\ ->       `"` ASCII* (non-greedy) `"`\ +>       `"` ASCII_FOR_RAW* (non-greedy) `"`\ >    | `#` RAW_BYTE_STRING_CONTENT `#` > -> ASCII :\ ->    _any ASCII (i.e. 0x00 to 0x7F)_ +> ASCII_FOR_RAW :\ +>    _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_ Raw byte string literals do not process any escapes. They start with the character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256 -of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The -_raw string body_ can contain any sequence of ASCII characters and is terminated -only by another `U+0022` (double-quote) character, followed by the same number of -`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) -character. A raw byte string literal can not contain any non-ASCII byte. +of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. + +The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR). +It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. +A raw byte string literal can not contain any non-ASCII byte. All characters contained in the raw string body represent their ASCII encoding, the characters `U+0022` (double-quote) (except when followed by at least as @@ -340,6 +344,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not permitted within a C string. +Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals. +When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. +See [String continuation escapes] for details. +The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape. + Some additional _escapes_ are available in non-raw C string literals. An escape starts with a `U+005C` (`\`) and continues with one of the following forms: @@ -382,11 +391,10 @@ c"\xC3\xA6"; Raw C string literals do not process any escapes. They start with the character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256 -of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The -_raw C string body_ can contain any sequence of Unicode characters (other than -`U+0000`) and is terminated only by another `U+0022` (double-quote) character, -followed by the same number of `U+0023` (`#`) characters that preceded the -opening `U+0022` (double-quote) character. +of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. + +The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR). +It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. All characters contained in the raw C string body represent themselves in UTF-8 encoding. The characters `U+0022` (double-quote) (except when followed by at From 5f512692d327fed3aeb31e16412d42c42a21101a Mon Sep 17 00:00:00 2001 From: Matthew Woodcraft Date: Sun, 28 Jan 2024 17:22:53 +0000 Subject: [PATCH 2/4] lexical structure: move the description of BOM-removal This takes place at the same time as CRLF normalisation. It's better not to list it in a Lexer block, as it isn't a token that can be fed to a macro. --- src/crates-and-source-files.md | 13 ++----------- src/input-format.md | 5 +++++ 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/src/crates-and-source-files.md b/src/crates-and-source-files.md index 8d54c3f..5b87519 100644 --- a/src/crates-and-source-files.md +++ b/src/crates-and-source-files.md @@ -2,13 +2,11 @@ > **Syntax**\ > _Crate_ :\ ->    UTF8BOM?\ >    SHEBANG?\ >    [_InnerAttribute_]\*\ >    [_Item_]\* > **Lexer**\ -> UTF8BOM : `\uFEFF`\ > SHEBANG : `#!` \~`\n`\+[†](#shebang) @@ -65,19 +63,13 @@ apply to the crate as a whole. #![warn(non_camel_case_types)] ``` -## Byte order mark - -The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the -file is encoded in UTF8. It can only occur at the beginning of the file and -is ignored by the compiler. - ## Shebang A source file can have a [_shebang_] (SHEBANG production), which indicates to the operating system what program to use to execute this file. It serves essentially to treat the source file as an executable script. The shebang -can only occur at the beginning of the file (but after the optional -_UTF8BOM_). It is ignored by the compiler. For example: +can only occur at the beginning of the file. +It is ignored by the compiler. For example: ```rust,ignore @@ -162,7 +154,6 @@ or `_` (U+005F) characters. [_Item_]: items.md [_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax [_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix) -[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 [`ExitCode`]: ../std/process/struct.ExitCode.html [`Infallible`]: ../std/convert/enum.Infallible.html [`Termination`]: ../std/process/trait.Termination.html diff --git a/src/input-format.md b/src/input-format.md index 4833165..df41557 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -9,6 +9,10 @@ See [Crates and source files] for a description of how programs are organised in Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8. +## Byte order mark removal + +If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed. + ## CRLF normalization Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). @@ -19,4 +23,5 @@ Other occurrences of the character `U+000D` (CR) are left in place (they are tre The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter. +[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 [Crates and source files]: crates-and-source-files.md From e364b6c6f91a7166ab4ff6a7814bf9a4922c2358 Mon Sep 17 00:00:00 2001 From: Matthew Woodcraft Date: Sun, 28 Jan 2024 18:10:27 +0000 Subject: [PATCH 3/4] lexical structure: move the description of shebang-removal This takes place after CRLF normalization. It's better not to list the shebang in a Lexer block, as it isn't a token that can be fed to a macro. --- src/crates-and-source-files.md | 33 +++------------------------------ src/input-format.md | 22 ++++++++++++++++++++++ 2 files changed, 25 insertions(+), 30 deletions(-) diff --git a/src/crates-and-source-files.md b/src/crates-and-source-files.md index 5b87519..2373a79 100644 --- a/src/crates-and-source-files.md +++ b/src/crates-and-source-files.md @@ -2,14 +2,9 @@ > **Syntax**\ > _Crate_ :\ ->    SHEBANG?\ >    [_InnerAttribute_]\*\ >    [_Item_]\* -> **Lexer**\ -> SHEBANG : `#!` \~`\n`\+[†](#shebang) - - > Note: Although Rust, like any other language, can be implemented by an > interpreter as well as a compiler, the only existing implementation is a > compiler, and the language has always been designed to be compiled. For these @@ -51,6 +46,8 @@ that apply to the containing module, most of which influence the behavior of the compiler. The anonymous crate module can have additional attributes that apply to the crate as a whole. +> **Note**: The file's contents may be preceded by a [shebang]. + ```rust // Specify the crate name. #![crate_name = "projx"] @@ -63,28 +60,6 @@ apply to the crate as a whole. #![warn(non_camel_case_types)] ``` -## Shebang - -A source file can have a [_shebang_] (SHEBANG production), which indicates -to the operating system what program to use to execute this file. It serves -essentially to treat the source file as an executable script. The shebang -can only occur at the beginning of the file. -It is ignored by the compiler. For example: - - -```rust,ignore -#!/usr/bin/env rustx - -fn main() { - println!("Hello!"); -} -``` - -A restriction is imposed on the shebang syntax to avoid confusion with an -[attribute]. The `#!` characters must not be followed by a `[` token, ignoring -intervening [comments] or [whitespace]. If this restriction fails, then it is -not treated as a shebang, but instead as the start of an attribute. - ## Preludes and `no_std` This section has been moved to the [Preludes chapter](names/preludes.md). @@ -153,19 +128,17 @@ or `_` (U+005F) characters. [_InnerAttribute_]: attributes.md [_Item_]: items.md [_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax -[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix) [`ExitCode`]: ../std/process/struct.ExitCode.html [`Infallible`]: ../std/convert/enum.Infallible.html [`Termination`]: ../std/process/trait.Termination.html [attribute]: attributes.md [attributes]: attributes.md -[comments]: comments.md [function]: items/functions.md [module]: items/modules.md [module path]: paths.md +[shebang]: input-format.md#shebang-removal [trait or lifetime bounds]: trait-bounds.md [where clauses]: items/generics.md#where-clauses -[whitespace]: whitespace.md