ijson in Rust: typed lexer

Catching up on my Rust learning diaries… Today I'm going to tell you about releaving my Lexer from its Pythonic legacy and what tangible results it produced, beside just being The Right Thing™.

Basic idea

The original lexer yielded three kinds of lexemes:

strings enclosed in quotes: "..."
multi-character literals and numbers: [a-z0-9eE\.\+-]+
single-character lexemes: brackets, braces, commas, colons, etc.

Type-wise all of them were strings, and it was the job of the parser to check what kind of lexemes they were: a known literal, something starting with a quote or something parsable as a number… or an error, failing all else.

This made total sense in Python where I, for example, just used a single generalized regexp to parse all non-string lexemes. It allowed for a very simple and readable code and it's in fact the only right way to parse byte buffers in an untyped GC-powered language where dealing with individual bytes introduces too much performance overhead.

In Rust though it simply felt foreign because the lexer already has an intimate understanding of what is it that it parses — something starting with ", or +|-|0..9, or {, … — it has to explicitly check them all anyway. Hence it seemed silly to just drop this intrinsic type information on the floor and clump everything back into strings.

Also I had a suspicion that it should affect performance quite significantly, as I had to allocate memory for and copy all those small string pieces. Lots of allocations and copying is never good!

Process

I started by introducing a dedicated Lexeme type distinguishing between strings, single-character lexemes and everything else under the umbrella term "scalar" (don't grumble about the name, it was destined to go away in any case):

#[derive(Debug, PartialEq)]
pub enum Lexeme {
    String(String),
    Scalar(String),
    OBrace,
    CBrace,
    OBracket,
    CBracket,
    Comma,
    Colon,
}

If anything, it made the code uglier as there were now two paradigms sitting in the code side by side: typed values and "scalars" that I had to handle the old way:

match lexeme {
    Lexeme::OBracket => Event::StartArray,
    Lexeme::OBrace => Event::StartMap,
    Lexeme::CBracket => Event::EndArray,
    Lexeme::CBrace => Event::EndMap,
    Lexeme::String(s) => Event::String(try!(unescape(s))),

// The Ugliness boundary :-)

    Lexeme::Scalar(ref s) if s == "null" => Event::Null,
    Lexeme::Scalar(ref s) if s == "true" => Event::Boolean(true),
    Lexeme::Scalar(ref s) if s == "false" => Event::Boolean(false),
    Lexeme::Scalar(s) => {
        Event::Number(try!(s.parse().map_err(|_| Error::Unknown(s))))
    },
    _ => unreachable!(),
}

Next, the string un-escaping business has been moved entirely into lexer. Even though it was a pretty much verbatim move of a bunch of code from one module to another, it made it obvious that I'm actually processing escaped characters twice: first simply to correctly find a terminating " and then to decode all those escapes into raw characters. This proved to be a good optimization opportunity later. It never ceases to amaze me how such simple refactorings sometimes give you a much better insight! Do not ignore them.

Finally, I split Lexeme::Scalar into honest numbers, booleans and the null. The code got more readable and more idiomatic all over, and there was much rejoicing!

Bumps along the road

During all those refactorings I had to constantly fiddle with error definitions (of which I wasn't a fan, to begin with). Changing wrapped error types and types of error parameters — all this really fun stuff, you know… This is the price of using a strongly-typed language: having the knowledge codified in two forms, type declarations and the code itself.

Performance

I didn't do this entire exercise just for feeling better about more idiomatic code (though that'd be a reason enough). It was actually the first step in the ongoing performance optimization endeavour that I promised last time.

Cutting right to the chase, this change gave me a 1.5x performance gain on my 18MB test JSON. Still a far cry though from my reference point, yajl:

Before refactoring	0.290 secs
After refactoring	0.193 secs
yajl	0.051 secs

As far as I understand, this gain can be attributed entirely to removing allocations of temporary strings for single-character lexemes. Though I didn't investigate it properly at that point.

See you next time for more optimization fun!

Comments: 2

hackflow

You are still making copies of vectors "true", "false" and such, as well as real strings. You can try representing strings by offset + length and matching named constants inplace.
Ivan Sagalaev

You are still making copies of vectors "true", "false" and such, as well as real strings. You can try representing strings by offset + length and matching named constants inplace.

It's not that simple unfortunately as it won't work when a lexeme crosses the buffer boundary. Still, I did a copy-less checking of "true", "false" and "null" in 7e341b3. In short, I walk through the buffer in the usual way and check each character against the known str.