Silly Formatting 2
So, as a follow-up to the original silly-formatting
post, I spent some time trying to figure out just how
bad the LALRPOP-based sillyfmt
implementation was. It turns out that there’s
really nice utilities for using the afl
fuzzer with Rust, so I hooked that up
with some pretty basic heuristics and let it run for a few seconds. Then, since
it turned out to be really bad, I got nerd-sniped into setting up a different
parser called tree-sitter
that uses a more generic parsing algorithm.
Unfortunately, that led to another rabbit-hole: tree-sitter
is
a C library,and while Rust/C interop is generally good, this is much less true
when you involve the WebAssembly toolchain(s).
At the end, I got something working with the stdweb
toolchain and some
glorious hacks. But it’s live and runs pretty well, I think!
Fuzzing⌗
In order to make the fuzzer work, I wrote a variant of sillyfmt
that would
crash if it lost alphanumeric characters (in theory I should care about symbols
of various sorts too, but there wasn’t a convenient filter for them and
I thought alphanumeric would be a good enough proxy). Thankfully, this is
actually really easy in Rust.
fuzz!(|data: &[u8]| {
if let Ok(s) = String::from_utf8(data.to_vec()) {
let mut output_buffer = Vec::with_capacity(1024 * 1024);
do_format(&mut output_buffer, s.clone()).unwrap();
let required: HashMap<char, usize> =
s.chars()
.filter(|c| c.is_alphanumeric())
.fold(HashMap::new(), |mut m, c| {
*m.entry(c).or_insert(0) += 1;
m
});
let observed: HashMap<char, usize> = String::from_utf8_lossy(&output_buffer)
.chars()
.fold(HashMap::new(), |mut m, c| {
*m.entry(c).or_insert(0) += 1;
m
});
for (c, count) in required {
let observed_count = observed.get(&c).cloned().unwrap_or(0);
assert!(observed_count >= count);
}
}
});
With initial seeds of JSON, this pretty quickly found issues where mismatched
containers would cause the contents to get dropped, and I couldn’t find a good
way to express my error-recovery intentions in LALRPOP-land. This situation
doesn’t actually happen much in my personal usage of sillyfmt
, but it seemed
like a big enough hole that I ought to fix it.
Using tree-sitter
⌗
A while later, I was browsing the internet and found references to a pretty
cool parser-generator called tree-sitter
, which was primarily intended for
use in the editing and syntax highlighting context. Since people often make
small syntax errors, this seemed like a good fit for sillyfmt
– in order to
be useful, tree-sitter
would have to continue to make somewhat-valid parses
with irregular and unexpected syntax.
As it turns out, tree-sitter
uses a more powerful parsing algorithm than
LALRPOP, so I could use essentially the same grammar. I added in some
additional handling for “binary ops”, e.g. expressions of the form a:b
or
a==b
, and replaced the handling for colon-sequences with a regular expression
for valid times.
module.exports = grammar({
name: 'sillyfmt',
rules: {
source_file: $ => repeat($._expression),
_expression: $ => prec(-10, choice($._nonseq_expr, $.comma_delimited_sequence)),
_nonseq_expr: $ => choice($.container, $.time, $.binary_op, $.symbol, $.text),
binary_op: $ => prec.left(10, seq($._nonseq_expr, $.symbol, $._nonseq_expr)),
symbol: $ => prec(-100, choice('===', '<=>', '=>', '->', '<=', '>=', '==', '=', '<', '>', ':', '-', '+')),
container: $ => choice(
seq(field('open', '('), repeat($._expression), field('close', ')')),
seq(field('open', '['), repeat($._expression), field('close', ']')),
seq(field('open', '{'), repeat($._expression), field('close', '}')),
seq(field('open', '<'), repeat($._expression), field('close', '>')),
),
comma_delimited_sequence: $ => prec.right(20, seq(
repeat1($._nonseq_expr), repeat1(seq(',', prec.right(repeat1($._nonseq_expr)))),
)),
text: $ => prec.left(-50, /[^()\[\]{},:=<>\s]+/),
time: $ => /([0-1]?[0-9]|[2][0-3]):([0-5][0-9])(:[0-5][0-9])?/,
},
conflicts: $ => [
[$.binary_op],
]
});
One pretty awesome thing about tree-sitter
is that there’s a whole CLI
intended to make writing and testing grammars easy. Since the primary user of
the parser is syntax highlighters embedded in Atom and Visual Studio Code, the
entire codebase is also intended to be WASM-compilable using Emscripten.
There’s also a handy Rust crate that wraps the generated C code and exposes the
resulting parse tree, so I hooked all of that up and got a sillyfmt
executable that passed at least a few seconds of fuzz-testing!
Trying to get it all to compile to WASM at once⌗
Unfortunately, it turns out that while Rust and C both have reasonably good support for WASM compilation, they don’t like it much if you link them together.
The history goes something like this: the early efforts to compile native code to the web were part of the Emscripten project, which initially targeted a JIT-able Javascript subset called ASM.js. Later, the Emscripten compiler was updated to emit WebAssembly (WASM) as browser support for that got better. A notable early demo of Emscripten technology was the ability to take a video game (complete with rendering engine) and run it directly in the browser.
This was really cool, but nobody wanted to write a brand new game engine for the demo. So, Emscripten has generally focused on taking existing C and C++ code, and providing whatever shims are necessary to get it to run on the web. This includes a virtual filesystem, mechanisms to compile OpenGL to WebGL, and a bunch of other stuff. It’s actually very, very impressive.
Since Emscripten is based on LLVM, it was only a matter of time until some Rust
developers tried to compile Rust code with it (this happened around 2017,
I think?). I’m not sure anyone really used this for much, but the Rust compiler
has had support for wasm32-unknown-emscripten
for quite some time.
More recently, there’s been a fair amount of interest in using WASM as a portable library format, both in the browser and in other environments. The intent of this work is pretty different than Emscripten – it’s more about writing novel code in Rust, and then calling that code from other languages (usually, Javascript).
The initial sillyfmt-wasm
implementation relied on this toolchain, using the
wasm32-unknown-unknown
target with direct LLVM WASM codegen, and
wasm-bindgen
to create working JS bindings. But, that only worked because the
prior implementations of sillyfmt
were pure Rust! With the new tree-sitter
version, there was now C code. And, to make things even more complicated, the
C code relied on the C standard library, which the wasm32-unknown-unknown
target doesn’t provide!
Fine, I figured. My Rust-expert friend Geoffry suggested that I try using the
Emscripten target, which seemed to compile for him (it didn’t compile for me).
Turns out that on Rust 1.41 the wasm32-unknown-emscripten
toolchain just
segfaults sometimes.
In general it seems like the Rust community has abandoned the
wasm32-unknown-emscripten
toolchain. The most recent compilers (nightlies)
and recent Emscripten versions would either generate code that panicked on
load, or which put some kind of random junk into static variables that the
tree-sitter
parser was using.
In the process of investigating this, I discovered the stdweb
project, which
makes it possible to write client-side web code in Rust. Aha! And it even
claims to support the wasm32-unknown-emscripten
target!
Alas, it looks like the last time someone had the Emscripten toolchain working
for stdweb
was more than a year ago, and the entire compiler backend for
Emscripten has changed since then. I wasted more time than I’m willing to admit
trying different rustc
and emcc
versions to see if there was a pair that
would get things working.
As it turns out, none of the ones I tried do. Fortunately, sillyfmt
isn’t
a very performance sensitive piece of code, so I ended up trying a really jank
hack to make things work. You see, Rust is this super generic language –
generic enough that I could hide the actual parser and parse-tree
implementation behind a trait… and then use Javascript to bridge between the
Rust WASM binary (generated by rustc
with wasm32-unknown-unknown
) and the
tree-sitter
WASM binary (generated by emcc
) to get the parser to work.
So I defined some traits:
pub trait ParseTree {
fn root_node(&self) -> Box<dyn ParseNode<'_> + '_>;
}
pub trait ParseCursor<'a> {
fn goto_first_child(&mut self) -> bool;
fn goto_next_sibling(&mut self) -> bool;
fn node(&self) -> Box<dyn ParseNode<'a> + 'a>;
fn field_name(&self) -> Option<String>;
}
pub trait ParseNode<'a> {
fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a>;
fn kind(&self) -> String;
fn utf8_text(&self, data: &'_ [u8]) -> String;
fn is_named(&self) -> bool;
}
and rewrote the formatting code to use the trait rather than the underlying parse-tree API.
In the native Rust binary, I link directly against the C API:
struct WrappedTree(Tree);
impl ParseTree for WrappedTree {
fn root_node(&self) -> Box<dyn ParseNode<'_> + '_> {
Box::new(WrappedNode(self.0.root_node()))
}
}
struct WrappedNode<'a>(Node<'a>);
impl<'a> ParseNode<'a> for WrappedNode<'a> {
fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a> {
Box::new(WrappedCursor(self.0.walk()))
}
fn kind(&self) -> String {
self.0.kind().to_string()
}
fn utf8_text(&self, data: &'_ [u8]) -> String {
self.0.utf8_text(data).unwrap().to_string()
}
fn is_named(&self) -> bool {
self.0.is_named()
}
}
struct WrappedCursor<'a>(TreeCursor<'a>);
impl<'a> ParseCursor<'a> for WrappedCursor<'a> {
fn goto_first_child(&mut self) -> bool {
self.0.goto_first_child()
}
fn goto_next_sibling(&mut self) -> bool {
self.0.goto_next_sibling()
}
fn node(&self) -> Box<dyn ParseNode<'a> + 'a> {
Box::new(WrappedNode(self.0.node()))
}
fn field_name(&self) -> Option<String> {
self.0.field_name().map(|x| x.to_string())
}
}
and in the Web version, I link against the tree-sitter
Javascript library!
struct WrappedTree(stdweb::Value);
impl ParseTree for WrappedTree {
fn root_node(&self) -> Box<dyn ParseNode<'_> + '_> {
Box::new(WrappedNode(js!( return @{&self.0}.rootNode; )))
}
}
struct WrappedNode(stdweb::Value);
impl<'a> ParseNode<'a> for WrappedNode {
fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a> {
Box::new(WrappedCursor(js!( return @{&self.0}.walk(); )))
}
fn kind(&self) -> String {
js!( return @{&self.0}.type; ).try_into().unwrap()
}
fn utf8_text(&self, _: &'_ [u8]) -> String {
js!( return @{&self.0}.text; ).try_into().unwrap()
}
fn is_named(&self) -> bool {
js!( return @{&self.0}.isNamed(); ).try_into().unwrap()
}
}
struct WrappedCursor(stdweb::Value);
impl<'a> ParseCursor<'a> for WrappedCursor {
fn goto_first_child(&mut self) -> bool {
js!( return @{&self.0}.gotoFirstChild(); ).try_into().unwrap()
}
fn goto_next_sibling(&mut self) -> bool {
js!( return @{&self.0}.gotoNextSibling(); ).try_into().unwrap()
}
fn node(&self) -> Box<dyn ParseNode<'a> + 'a> {
Box::new(WrappedNode(js!( return @{&self.0}.currentNode(); )))
}
fn field_name(&self) -> Option<String> {
js!( return @{&self.0}.currentFieldName(); ).try_into().unwrap()
}
}
This generates some pretty horrendous Javascript, but it seems to work OK.
I’ll admit also that it’s pretty cool to write Rust code and have it do things that I associate pretty strongly with Javascript, like manipulating the DOM:
fn main() {
stdweb::initialize();
let input: TextAreaElement = document()
.query_selector("#text-input")
.unwrap()
.unwrap()
.try_into()
.unwrap();
let output = document().query_selector("#text-output").unwrap().unwrap();
input.add_event_listener(enclose!( (input, output) move |_: InputEvent| {
let s = input.value();
let mut out = Vec::new();
let _ = silly_format(Cursor::new(s), Cursor::new(&mut out), false, |x| {
Box::new(WrappedTree(js!(
return parser.parse(@{x});
)))
});
let formatted = String::from_utf8_lossy(&out);
output.set_text_content(&*formatted);
}));
stdweb::event_loop();
}
The long chains of unwrap
are a little annoying, but it ends up actually
looking pretty similar to the JS implementation.