Silly Formatting 2 :: Robert Ying

So, as a follow-up to the original silly-formatting post, I spent some time trying to figure out just how bad the LALRPOP-based sillyfmt implementation was. It turns out that there’s really nice utilities for using the afl fuzzer with Rust, so I hooked that up with some pretty basic heuristics and let it run for a few seconds. Then, since it turned out to be really bad, I got nerd-sniped into setting up a different parser called tree-sitter that uses a more generic parsing algorithm. Unfortunately, that led to another rabbit-hole: tree-sitter is a C library,and while Rust/C interop is generally good, this is much less true when you involve the WebAssembly toolchain(s).

At the end, I got something working with the stdweb toolchain and some glorious hacks. But it’s live and runs pretty well, I think!

Fuzzing⌗

In order to make the fuzzer work, I wrote a variant of sillyfmt that would crash if it lost alphanumeric characters (in theory I should care about symbols of various sorts too, but there wasn’t a convenient filter for them and I thought alphanumeric would be a good enough proxy). Thankfully, this is actually really easy in Rust.

fuzz!(|data: &[u8]| {
    if let Ok(s) = String::from_utf8(data.to_vec()) {
        let mut output_buffer = Vec::with_capacity(1024 * 1024);
        do_format(&mut output_buffer, s.clone()).unwrap();

        let required: HashMap<char, usize> =
            s.chars()
                .filter(|c| c.is_alphanumeric())
                .fold(HashMap::new(), |mut m, c| {
                    *m.entry(c).or_insert(0) += 1;
                    m
                });
        let observed: HashMap<char, usize> = String::from_utf8_lossy(&output_buffer)
            .chars()
            .fold(HashMap::new(), |mut m, c| {
                *m.entry(c).or_insert(0) += 1;
                m
            });

        for (c, count) in required {
            let observed_count = observed.get(&c).cloned().unwrap_or(0);
            assert!(observed_count >= count);
        }
    }
});

With initial seeds of JSON, this pretty quickly found issues where mismatched containers would cause the contents to get dropped, and I couldn’t find a good way to express my error-recovery intentions in LALRPOP-land. This situation doesn’t actually happen much in my personal usage of sillyfmt, but it seemed like a big enough hole that I ought to fix it.

Using `tree-sitter`⌗

A while later, I was browsing the internet and found references to a pretty cool parser-generator called tree-sitter, which was primarily intended for use in the editing and syntax highlighting context. Since people often make small syntax errors, this seemed like a good fit for sillyfmt – in order to be useful, tree-sitter would have to continue to make somewhat-valid parses with irregular and unexpected syntax.

As it turns out, tree-sitter uses a more powerful parsing algorithm than LALRPOP, so I could use essentially the same grammar. I added in some additional handling for “binary ops”, e.g. expressions of the form a:b or a==b, and replaced the handling for colon-sequences with a regular expression for valid times.

module.exports = grammar({
  name: 'sillyfmt',
  rules: {
    source_file: $ => repeat($._expression),

    _expression: $ => prec(-10, choice($._nonseq_expr, $.comma_delimited_sequence)),

    _nonseq_expr: $ => choice($.container, $.time, $.binary_op, $.symbol, $.text),

    binary_op: $ => prec.left(10, seq($._nonseq_expr, $.symbol, $._nonseq_expr)),

    symbol: $ => prec(-100, choice('===', '<=>', '=>', '->', '<=', '>=', '==', '=', '<', '>', ':', '-', '+')),

    container: $ => choice(
      seq(field('open', '('), repeat($._expression), field('close', ')')),
      seq(field('open', '['), repeat($._expression), field('close', ']')),
      seq(field('open', '{'), repeat($._expression), field('close', '}')),
      seq(field('open', '<'), repeat($._expression), field('close', '>')),
    ),

    comma_delimited_sequence: $ => prec.right(20, seq(
      repeat1($._nonseq_expr), repeat1(seq(',', prec.right(repeat1($._nonseq_expr)))),
    )),

    text: $ => prec.left(-50, /[^()\[\]{},:=<>\s]+/),
    time: $ => /([0-1]?[0-9]|[2][0-3]):([0-5][0-9])(:[0-5][0-9])?/,
  },
  conflicts: $ => [
    [$.binary_op],
  ]
});

One pretty awesome thing about tree-sitter is that there’s a whole CLI intended to make writing and testing grammars easy. Since the primary user of the parser is syntax highlighters embedded in Atom and Visual Studio Code, the entire codebase is also intended to be WASM-compilable using Emscripten.

There’s also a handy Rust crate that wraps the generated C code and exposes the resulting parse tree, so I hooked all of that up and got a sillyfmt executable that passed at least a few seconds of fuzz-testing!

Trying to get it all to compile to WASM at once⌗

Unfortunately, it turns out that while Rust and C both have reasonably good support for WASM compilation, they don’t like it much if you link them together.

The history goes something like this: the early efforts to compile native code to the web were part of the Emscripten project, which initially targeted a JIT-able Javascript subset called ASM.js. Later, the Emscripten compiler was updated to emit WebAssembly (WASM) as browser support for that got better. A notable early demo of Emscripten technology was the ability to take a video game (complete with rendering engine) and run it directly in the browser.

This was really cool, but nobody wanted to write a brand new game engine for the demo. So, Emscripten has generally focused on taking existing C and C++ code, and providing whatever shims are necessary to get it to run on the web. This includes a virtual filesystem, mechanisms to compile OpenGL to WebGL, and a bunch of other stuff. It’s actually very, very impressive.

Since Emscripten is based on LLVM, it was only a matter of time until some Rust developers tried to compile Rust code with it (this happened around 2017, I think?). I’m not sure anyone really used this for much, but the Rust compiler has had support for wasm32-unknown-emscripten for quite some time.

More recently, there’s been a fair amount of interest in using WASM as a portable library format, both in the browser and in other environments. The intent of this work is pretty different than Emscripten – it’s more about writing novel code in Rust, and then calling that code from other languages (usually, Javascript).

The initial sillyfmt-wasm implementation relied on this toolchain, using the wasm32-unknown-unknown target with direct LLVM WASM codegen, and wasm-bindgen to create working JS bindings. But, that only worked because the prior implementations of sillyfmt were pure Rust! With the new tree-sitter version, there was now C code. And, to make things even more complicated, the C code relied on the C standard library, which the wasm32-unknown-unknown target doesn’t provide!

Fine, I figured. My Rust-expert friend Geoffry suggested that I try using the Emscripten target, which seemed to compile for him (it didn’t compile for me). Turns out that on Rust 1.41 the wasm32-unknown-emscripten toolchain just segfaults sometimes.

In general it seems like the Rust community has abandoned the wasm32-unknown-emscripten toolchain. The most recent compilers (nightlies) and recent Emscripten versions would either generate code that panicked on load, or which put some kind of random junk into static variables that the tree-sitter parser was using.

In the process of investigating this, I discovered the stdweb project, which makes it possible to write client-side web code in Rust. Aha! And it even claims to support the wasm32-unknown-emscripten target!

Alas, it looks like the last time someone had the Emscripten toolchain working for stdweb was more than a year ago, and the entire compiler backend for Emscripten has changed since then. I wasted more time than I’m willing to admit trying different rustc and emcc versions to see if there was a pair that would get things working.

As it turns out, none of the ones I tried do. Fortunately, sillyfmt isn’t a very performance sensitive piece of code, so I ended up trying a really jank hack to make things work. You see, Rust is this super generic language – generic enough that I could hide the actual parser and parse-tree implementation behind a trait… and then use Javascript to bridge between the Rust WASM binary (generated by rustc with wasm32-unknown-unknown) and the tree-sitter WASM binary (generated by emcc) to get the parser to work.

So I defined some traits:

pub trait ParseTree {
    fn root_node(&self) -> Box<dyn ParseNode<'_> + '_>;
}
pub trait ParseCursor<'a> {
    fn goto_first_child(&mut self) -> bool;
    fn goto_next_sibling(&mut self) -> bool;
    fn node(&self) -> Box<dyn ParseNode<'a> + 'a>;
    fn field_name(&self) -> Option<String>;
}
pub trait ParseNode<'a> {
    fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a>;
    fn kind(&self) -> String;
    fn utf8_text(&self, data: &'_ [u8]) -> String;
    fn is_named(&self) -> bool;
}

and rewrote the formatting code to use the trait rather than the underlying parse-tree API.

In the native Rust binary, I link directly against the C API:

struct WrappedTree(Tree);
impl ParseTree for WrappedTree {
    fn root_node(&self) -> Box<dyn ParseNode<'_> + '_> {
        Box::new(WrappedNode(self.0.root_node()))
    }
}
struct WrappedNode<'a>(Node<'a>);
impl<'a> ParseNode<'a> for WrappedNode<'a> {
    fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a> {
        Box::new(WrappedCursor(self.0.walk()))
    }
    fn kind(&self) -> String {
        self.0.kind().to_string()
    }
    fn utf8_text(&self, data: &'_ [u8]) -> String {
        self.0.utf8_text(data).unwrap().to_string()
    }
    fn is_named(&self) -> bool {
        self.0.is_named()
    }
}
struct WrappedCursor<'a>(TreeCursor<'a>);
impl<'a> ParseCursor<'a> for WrappedCursor<'a> {
    fn goto_first_child(&mut self) -> bool {
        self.0.goto_first_child()
    }
    fn goto_next_sibling(&mut self) -> bool {
        self.0.goto_next_sibling()
    }
    fn node(&self) -> Box<dyn ParseNode<'a> + 'a> {
        Box::new(WrappedNode(self.0.node()))
    }
    fn field_name(&self) -> Option<String> {
        self.0.field_name().map(|x| x.to_string())
    }
}

and in the Web version, I link against the tree-sitter Javascript library!

struct WrappedTree(stdweb::Value);
impl ParseTree for WrappedTree {
    fn root_node(&self) -> Box<dyn ParseNode<'_> + '_> {
        Box::new(WrappedNode(js!( return @{&self.0}.rootNode; )))
    }
}
struct WrappedNode(stdweb::Value);
impl<'a> ParseNode<'a> for WrappedNode {
    fn walk(&self) -> Box<dyn ParseCursor<'a> + 'a> {
        Box::new(WrappedCursor(js!( return @{&self.0}.walk(); )))
    }
    fn kind(&self) -> String {
        js!( return @{&self.0}.type; ).try_into().unwrap()
    }
    fn utf8_text(&self, _: &'_ [u8]) -> String {
        js!( return @{&self.0}.text; ).try_into().unwrap()
    }
    fn is_named(&self) -> bool {
        js!( return @{&self.0}.isNamed(); ).try_into().unwrap()
    }
}
struct WrappedCursor(stdweb::Value);
impl<'a> ParseCursor<'a> for WrappedCursor {
    fn goto_first_child(&mut self) -> bool {
        js!( return @{&self.0}.gotoFirstChild(); ).try_into().unwrap()
    }
    fn goto_next_sibling(&mut self) -> bool {
        js!( return @{&self.0}.gotoNextSibling(); ).try_into().unwrap()
    }
    fn node(&self) -> Box<dyn ParseNode<'a> + 'a> {
        Box::new(WrappedNode(js!( return @{&self.0}.currentNode(); )))
    }
    fn field_name(&self) -> Option<String> {
        js!( return @{&self.0}.currentFieldName(); ).try_into().unwrap()
    }
}

This generates some pretty horrendous Javascript, but it seems to work OK.

I’ll admit also that it’s pretty cool to write Rust code and have it do things that I associate pretty strongly with Javascript, like manipulating the DOM:

fn main() {
    stdweb::initialize();

    let input: TextAreaElement = document()
        .query_selector("#text-input")
        .unwrap()
        .unwrap()
        .try_into()
        .unwrap();
    let output = document().query_selector("#text-output").unwrap().unwrap();

    input.add_event_listener(enclose!( (input, output) move |_: InputEvent| {
        let s = input.value();
        let mut out = Vec::new();
        let _ = silly_format(Cursor::new(s), Cursor::new(&mut out), false, |x| {
            Box::new(WrappedTree(js!(
                return parser.parse(@{x});
            )))
        });
        let formatted = String::from_utf8_lossy(&out);
        output.set_text_content(&*formatted);
    }));

    stdweb::event_loop();
}

The long chains of unwrap are a little annoying, but it ends up actually looking pretty similar to the JS implementation.

Silly Formatting 2

Fuzzing⌗

Using tree-sitter⌗

Trying to get it all to compile to WASM at once⌗

Using `tree-sitter`⌗