mirror of https://github.com/rust-lang/reference
Merge pull request #1459 from mattheww/2024-01_input_format
Input format
This commit is contained in:
commit
5afb503a4c
|
@ -30,7 +30,7 @@
|
|||
> | INNER_BLOCK_DOC
|
||||
>
|
||||
> _IsolatedCR_ :\
|
||||
> _A `\r` not followed by a `\n`_
|
||||
> \\r
|
||||
|
||||
## Non-doc comments
|
||||
|
||||
|
@ -53,8 +53,9 @@ that follows. That is, they are equivalent to writing `#![doc="..."]` around
|
|||
the body of the comment. `//!` comments are usually used to document
|
||||
modules that occupy a source file.
|
||||
|
||||
Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
|
||||
comments.
|
||||
The character `U+000D` (CR) is not allowed in doc comments.
|
||||
|
||||
> **Note**: The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).
|
||||
|
||||
## Examples
|
||||
|
||||
|
|
|
@ -2,16 +2,9 @@
|
|||
|
||||
> **<sup>Syntax</sup>**\
|
||||
> _Crate_ :\
|
||||
> UTF8BOM<sup>?</sup>\
|
||||
> SHEBANG<sup>?</sup>\
|
||||
> [_InnerAttribute_]<sup>\*</sup>\
|
||||
> [_Item_]<sup>\*</sup>
|
||||
|
||||
> **<sup>Lexer</sup>**\
|
||||
> UTF8BOM : `\uFEFF`\
|
||||
> SHEBANG : `#!` \~`\n`<sup>\+</sup>[†](#shebang)
|
||||
|
||||
|
||||
> Note: Although Rust, like any other language, can be implemented by an
|
||||
> interpreter as well as a compiler, the only existing implementation is a
|
||||
> compiler, and the language has always been designed to be compiled. For these
|
||||
|
@ -53,6 +46,8 @@ that apply to the containing module, most of which influence the behavior of
|
|||
the compiler. The anonymous crate module can have additional attributes that
|
||||
apply to the crate as a whole.
|
||||
|
||||
> **Note**: The file's contents may be preceded by a [shebang].
|
||||
|
||||
```rust
|
||||
// Specify the crate name.
|
||||
#![crate_name = "projx"]
|
||||
|
@ -65,34 +60,6 @@ apply to the crate as a whole.
|
|||
#![warn(non_camel_case_types)]
|
||||
```
|
||||
|
||||
## Byte order mark
|
||||
|
||||
The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
|
||||
file is encoded in UTF8. It can only occur at the beginning of the file and
|
||||
is ignored by the compiler.
|
||||
|
||||
## Shebang
|
||||
|
||||
A source file can have a [_shebang_] (SHEBANG production), which indicates
|
||||
to the operating system what program to use to execute this file. It serves
|
||||
essentially to treat the source file as an executable script. The shebang
|
||||
can only occur at the beginning of the file (but after the optional
|
||||
_UTF8BOM_). It is ignored by the compiler. For example:
|
||||
|
||||
<!-- ignore: tests don't like shebang -->
|
||||
```rust,ignore
|
||||
#!/usr/bin/env rustx
|
||||
|
||||
fn main() {
|
||||
println!("Hello!");
|
||||
}
|
||||
```
|
||||
|
||||
A restriction is imposed on the shebang syntax to avoid confusion with an
|
||||
[attribute]. The `#!` characters must not be followed by a `[` token, ignoring
|
||||
intervening [comments] or [whitespace]. If this restriction fails, then it is
|
||||
not treated as a shebang, but instead as the start of an attribute.
|
||||
|
||||
## Preludes and `no_std`
|
||||
|
||||
This section has been moved to the [Preludes chapter](names/preludes.md).
|
||||
|
@ -161,20 +128,17 @@ or `_` (U+005F) characters.
|
|||
[_InnerAttribute_]: attributes.md
|
||||
[_Item_]: items.md
|
||||
[_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
|
||||
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
|
||||
[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
|
||||
[`ExitCode`]: ../std/process/struct.ExitCode.html
|
||||
[`Infallible`]: ../std/convert/enum.Infallible.html
|
||||
[`Termination`]: ../std/process/trait.Termination.html
|
||||
[attribute]: attributes.md
|
||||
[attributes]: attributes.md
|
||||
[comments]: comments.md
|
||||
[function]: items/functions.md
|
||||
[module]: items/modules.md
|
||||
[module path]: paths.md
|
||||
[shebang]: input-format.md#shebang-removal
|
||||
[trait or lifetime bounds]: trait-bounds.md
|
||||
[where clauses]: items/generics.md#where-clauses
|
||||
[whitespace]: whitespace.md
|
||||
|
||||
<script>
|
||||
(function() {
|
||||
|
|
|
@ -1,3 +1,55 @@
|
|||
# Input format
|
||||
|
||||
Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
|
||||
This chapter describes how a source file is interpreted as a sequence of tokens.
|
||||
|
||||
See [Crates and source files] for a description of how programs are organised into files.
|
||||
|
||||
## Source encoding
|
||||
|
||||
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
|
||||
It is an error if the file is not valid UTF-8.
|
||||
|
||||
## Byte order mark removal
|
||||
|
||||
If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
|
||||
|
||||
## CRLF normalization
|
||||
|
||||
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
|
||||
|
||||
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
|
||||
|
||||
## Shebang removal
|
||||
|
||||
If the remaining sequence begins with the characters `!#`, the characters up to and including the first `U+000A` (LF) are removed from the sequence.
|
||||
|
||||
For example, the first line of the following file would be ignored:
|
||||
|
||||
<!-- ignore: tests don't like shebang -->
|
||||
```rust,ignore
|
||||
#!/usr/bin/env rustx
|
||||
|
||||
fn main() {
|
||||
println!("Hello!");
|
||||
}
|
||||
```
|
||||
|
||||
As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
|
||||
This prevents an [inner attribute] at the start of a source file being removed.
|
||||
|
||||
> **Note**: The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not.
|
||||
|
||||
## Tokenization
|
||||
|
||||
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
|
||||
|
||||
|
||||
[`include!`]: ../std/macro.include.md
|
||||
[`include_bytes!`]: ../std/macro.include_bytes.md
|
||||
[`include_str!`]: ../std/macro.include_str.md
|
||||
[inner attribute]: attributes.md
|
||||
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
|
||||
[comments]: comments.md
|
||||
[Crates and source files]: crates-and-source-files.md
|
||||
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
|
||||
[whitespace]: whitespace.md
|
||||
|
|
|
@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].
|
|||
|
||||
[^nsets]: The number of `#`s on each side of the same literal must be equivalent.
|
||||
|
||||
> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
|
||||
|
||||
#### ASCII escapes
|
||||
|
||||
| | Name |
|
||||
|
@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
|
|||
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
|
||||
which must be _escaped_ by a preceding `U+005C` character (`\`).
|
||||
|
||||
Line-breaks are allowed in string literals.
|
||||
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
|
||||
Both byte sequences are translated to `U+000A`.
|
||||
|
||||
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
|
||||
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
|
||||
See [String continuation escapes] for details.
|
||||
|
||||
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
|
||||
|
||||
#### Character escapes
|
||||
|
||||
|
@ -198,10 +197,10 @@ following forms:
|
|||
|
||||
Raw string literals do not process any escapes. They start with the character
|
||||
`U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
|
||||
`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
|
||||
of Unicode characters and is terminated only by another `U+0022` (double-quote)
|
||||
character, followed by the same number of `U+0023` (`#`) characters that preceded
|
||||
the opening `U+0022` (double-quote) character.
|
||||
`U+0022` (double-quote) character.
|
||||
|
||||
The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
|
||||
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
|
||||
|
||||
All Unicode characters contained in the raw string body represent themselves,
|
||||
the characters `U+0022` (double-quote) (except when followed by at least as
|
||||
|
@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
|
|||
Alternatively, a byte string literal can be a _raw byte string literal_, defined
|
||||
below.
|
||||
|
||||
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
|
||||
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
|
||||
See [String continuation escapes] for details.
|
||||
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
|
||||
|
||||
Some additional _escapes_ are available in either byte or non-raw byte string
|
||||
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
|
||||
following forms:
|
||||
|
@ -281,19 +285,19 @@ following forms:
|
|||
> `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
|
||||
>
|
||||
> RAW_BYTE_STRING_CONTENT :\
|
||||
> `"` ASCII<sup>* (non-greedy)</sup> `"`\
|
||||
> `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
|
||||
> | `#` RAW_BYTE_STRING_CONTENT `#`
|
||||
>
|
||||
> ASCII :\
|
||||
> _any ASCII (i.e. 0x00 to 0x7F)_
|
||||
> ASCII_FOR_RAW :\
|
||||
> _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
|
||||
|
||||
Raw byte string literals do not process any escapes. They start with the
|
||||
character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
|
||||
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
|
||||
_raw string body_ can contain any sequence of ASCII characters and is terminated
|
||||
only by another `U+0022` (double-quote) character, followed by the same number of
|
||||
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
|
||||
character. A raw byte string literal can not contain any non-ASCII byte.
|
||||
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
|
||||
|
||||
The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
|
||||
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
|
||||
A raw byte string literal can not contain any non-ASCII byte.
|
||||
|
||||
All characters contained in the raw string body represent their ASCII encoding,
|
||||
the characters `U+0022` (double-quote) (except when followed by at least as
|
||||
|
@ -339,6 +343,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
|
|||
literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
|
||||
permitted within a C string.
|
||||
|
||||
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
|
||||
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
|
||||
See [String continuation escapes] for details.
|
||||
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
|
||||
|
||||
Some additional _escapes_ are available in non-raw C string literals. An escape
|
||||
starts with a `U+005C` (`\`) and continues with one of the following forms:
|
||||
|
||||
|
@ -381,11 +390,10 @@ c"\xC3\xA6";
|
|||
|
||||
Raw C string literals do not process any escapes. They start with the
|
||||
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
|
||||
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
|
||||
_raw C string body_ can contain any sequence of Unicode characters (other than
|
||||
`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
|
||||
followed by the same number of `U+0023` (`#`) characters that preceded the
|
||||
opening `U+0022` (double-quote) character.
|
||||
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.
|
||||
|
||||
The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
|
||||
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
|
||||
|
||||
All characters contained in the raw C string body represent themselves in UTF-8
|
||||
encoding. The characters `U+0022` (double-quote) (except when followed by at
|
||||
|
|
Loading…
Reference in New Issue