Add a quick post about parsers, hooray

This commit is contained in:
R Tyler Croy 2020-12-21 20:47:38 -08:00
parent d2a2433a25
commit ed589d68c4
No known key found for this signature in database
GPG Key ID: E5C92681BEF6CEA2
1 changed files with 104 additions and 0 deletions

View File

@ -0,0 +1,104 @@
---
layout: post
title: Parsing in Rust
tags:
- rust
- pest
- antlr
---
In a world where everything is increasingly YAML, you might find yourself
wondering: "why bother to write a parser?" For starters, I recommend reading
the [YAML specification](https://yaml.org/spec/1.2/spec.html) before if you
haven't, but more importantly: there are so many domains which can be better
modeled with domain-specific semantics and syntax. When I was younger parsing
was typically done with lexx/yacc/bison/whatever and was complete drudgery, but
there are a few great modern tools in the Rust ecosystem that make writing
parsers _fun_.
I first dabbled in writing parsers with [ANTLRv4](https://github.com/antlr)
which is an absolutely **fantastic** toolset for writing parsers. The primary
author [Terence Parr](https://github.com/parrt) has written a number of good
books such as "The Definitive ANTLR 4 Reference" and "Language Implementation
Patterns". Both of which I recommend even if you're not setting out to write
that next great programming language.
In [Rust](https://rust-lang.org) our options are also pretty decent. When I
first ventured into writing Rust I discovered
[antlr4rust](https://github.com/rrevenantt/antlr4rust) which I promptly
bookmarked and then set aside until I had a parsing project. Once I finally had
a parsing project, I revisited the project and found that I didn't like the
ANTLR-like semantics in the Rust language. It didn't quite feel idiomatic
enough for me to feel comfortable.
More recently I have discovered **[Pest](https://pest.rs/)** which I have now
used within [Otto](https://github.com/rtyler/otto) and my most recent
experiment [Jenkins Declarative Parser](https://github.com/rtyler/jdp).
The grammar is similar enough to ANTLR that I was able to get started and my ideas quite quickly. Still, I haven't become clever enough to use parser-level stack manipulations, so I think that means I remain a parser-simpleton.
Below is an example of the grammar necessary to parse the `script { }` step in
Declarative Jenkins Pipelines, which themselves allow arbitrary Groovy code
within them (I didn't want to parse the groovy too).
```peg
scriptStep = { "script" ~ opening_brace ~ groovy ~ closing_brace }
groovy = {
(
// Handle nested structures
(opening_brace ~ groovy ~ closing_brace)
| (!closing_brace ~ ANY)
)*
}
stagesDecl = { "stages" ~
opening_brace ~
stage+ ~
closing_brace
}
```
The qualifiers and details on the grammar can be found in the [pest_derive
crate's documentation](https://docs.rs/pest_derive/).
Once compiled into the Rust program, using the generated parser is a _little_
goofy but still very workable, a snippet:
```rust
let mut parser = PipelineParser::parse(Rule::pipeline, buffer)?;
while let Some(parsed) = parser.next() {
match parsed.as_rule() {
Rule::agentDecl => {
// parse the agent {} declaration
}
Rule::stagesDecl => {
parse_stages(&mut parsed.into_inner())?;
}
_ => {}
}
}
```
The parsers I am writing tend to be relatively simplistic, taking user-friendly
models and turning them into internal data structures for further use. While
basic it reminds me of the domain-specific language (DSL) "fad" among Rubyists.
I once joked "for loving Ruby so much, Rubyists sure do spend a lot of time
building tools to avoid writing Ruby." Once you have a simple and easy approach
to create syntax and tooling that better models the domain you're working it,
it's hard to avoid!
YAML, XML, and JSON have their place as data serialization formats, but far too
frequently they're used for configuration or other descriptive usages. Many
developers will cite "everybody knows YAML" in their use, thereby overlooking
that "syntax" and "semantics" are two very distinct pieces of the puzzle. Yes,
most everybody grasps the basics of YAML syntax, however whatever keys a
program is encoding as semantically significant for its configuration (see:
Kubernetes) is a _very_ different story.
The next time you find yourself needing to describe or model complex concepts
for your program, consider creating a language to describe it! Writing the
parser will be easier than you might think!