It took me a while but I've finally implemented a working parser on top of the lexer in my little Rust learning project. I learned a lot and feel much more comfortable with the language by now. In the meantime I even managed to get out to a Rust Seattle meetup, meet new folks and share an idea about doing some coding together in the future. Let's see how it'll work out.

Power of yield

First, a digression. It's not about Rust at all, but I'll get to my point eventually, I promise!

When I was coding the same problem in Python I didn't fully appreciate the expressive power of generators, I simply used them because it seemed to be the most natural way. Have a look:

def parse_value(lexer, symbol=None):
    // ...

    elif symbol == '[':
        yield from parse_array(lexer)
    elif symbol == '{':
        yield from parse_object(lexer)

    // ...

def parse_array(lexer):
    yield ('start_array', None)
    symbol = next(lexer)
    if symbol != ']':
        while True:
            yield from parse_value(lexer, symbol)
            symbol = next(lexer)
            if symbol == ']':
                break
            if symbol != ',':
                raise UnexpectedSymbol(symbol)
            symbol = next(lexer)
    yield ('end_array', None)

Parsing JSON (or almost anything, for that matter) requires keeping a state describing where in the structure you are now: are you expecting a scalar value, or an object key, or a comma, etc. Also, since arrays and objects can be nested you have to keep track of them opening and closing in the correct order in a some sort of stack.

Magic of the yield keyword lets you leave a function and then return to the same place, implicitly giving you both the state and the stack for free:

With both of those facilities out of the way the code simply represents the grammar in the natural order. All of that thanks to the semantics of yield.

Going the hard, explicit, low-level way

Rust doesn't have yield. Which means iteration is implemented by repeatedly calling the next() function, and you have to explicitly keep both the state and the stack in the iterator object between the calls.

This is where I spent some number of days trying different approaches, figuring out which states I need, whether I need a loop processing non-significant lexemes like : and , or a recursive call to next() would do the job, stuff like that. It was hard, partly because of my unfamiliarity with the language and partly because I'm definitely not the best algorithmist out there. But ultimately there's always a price you pay in productivity when working in a typed language, especially the one with a strict policy on using pointers.

In Python, I tend to work top-down, starting with roughly sketching the whole algorithm making it just barely runnable to see if it works at all as early as possible. I rely on the language letting me be sloppy with types and error handling and leaving whole parts of the code essentially non-working if I'm not going to run them just yet.

In Rust, the compiler doesn't want to hear your pleads and promises that you're going to cleanup the mess later: everything must be tidied up and compiled, period. Want to play with adding a flag to one of the states? Sure, just go ahead and update the definition a couple of screens up and the initialization a couple of screens down. Want to see if you can call that code recursively? Well, this is next(&mut self) — a function taking a mutable reference and you can't have more than one, ever. So no, you can't call that one recursively, you'll probably have to extract that part of logic in another function and go through some amount of yak shaving making sure it's pure and doesn't want a mutable self. At which point it doesn't look like a quick checking out anymore.

Constant context switching between thinking about overall architecture and implementation details, such as reference herding, is the hardest part of Rust for me right now. I think it's unavoidable, even though I'm getting better at it :-)

It's all not in vain, of course. Types do help in reasoning about code. If you see a function taking an immutable reference you know it won't change it without looking through its code. You also know it won't suddenly become non-pure later on.

Enums

Enough whining, though. If there's one thing that I'm completely in love right now it's enums! What's so exciting about enums, you ask? Well, first of all, they're misleadingly named. They are really tagged unions that represent a value coming in several variants. Here's an example.

As my parser goes through a JSON it yields events, like a start of an object, a key, a number, a boolean, an end of an object, etc. You want to know two things about an event: what it is and, for somey, the actual value associated with it. Here's how the type looks in Rust:

enum Event {
    Null,
    Boolean(bool),
    String(String),
    Key(String),
    Number(f64),
    StartArray,
    EndArray,
    StartMap,
    EndMap,
}

The wonderful part is that when processing this you get safe typed values without doing any casting:

match event {
    Boolean(value) => // `value` is bool
    Number(value) =>  // `value` is f64
    StartArray =>     // you don't need no values here, so you're not getting any
    // ...
}

Neat, right!? And pattern matching can be way more elaborate, by the way.

What you can't do though is to simply check the value of an enum with an if:

if self.state == State::Closed { ... }

// error: binary operation `==` cannot be applied to type `State` [E0369]

This tripped me up pretty severely at one point when I organized my whole logic around checking for specific states here and there. For example, an object key in JSON is not awfully different from any scalar value: you're parsing a string and then just call it "key" instead of "string". Nope, didn't let me do that, no sir. Had to reshape all that state handling into one big match with very similar looking parts.

But Rust was right, though, after all. Because treating an object key is in fact completely different from treating scalar values if you take into account the dynamics of states and the stack. They won't even share the string parsing code because keys don't need handling of backslash escapes.

if let

Rust extensively uses two enum types throughout the standard library: Option and Result. I was under the impression that inability to use if means that whenever you work with a function returning any of those you have to handle it with match and lose any dreams of composability.

However a few days ago I stumbled upon an excellent article "Error Handling in Rust" which I wholeheartedly recommend to any Rust beginner. From it I learned about if let: it follows the same rules for matching as match, so you can write this:

if let Event::Number(value) = event {
    // handle only the case when `event` is a Number
}

This is actually an assignment, so you have = instead of == and an rvalue on the right. It can look even weirder with parameter-less enum variants:

if let State::Closed = state {
    // - Did you just assign a variable to a value, Bob?
    // - Shut up and handle your closed state.
}

However what I miss is something like if not let, basically an else-clause of that if. Say, I want to pop a value from the stack and make sure it's the one I expect (and the stack is not empty, of course). Currently I do this:

match self.stack.pop() {
    Some(b'[') => (),
    _ => panic!("Unmatched ]"),
}

All this business with the do-nothing thingie () and the default case _ => is a little bit not pretty. What I really want is this:

if not let Some(b'[') = self.stack.pop() {
  panic!("Unmatched ]") // or something more sensible than panic! when I get to it
}

But it's nitpicking, of course :-)

Peekable lexer

It occurred to me at one point that my Lexer, being an iterator, lacks the ability to tell me which lexeme would come next without actually consuming it. In other words, it wasn't "peekable". Without this, for example, I had to call out to the entire non-pure process of handling another parser state when, say, I got a closing brace } while expecting an object key.

So I went ahead and replaced an Iterator trait on the Lexer with a passable custom implementation sporting methods lookup() and consume() (learning about Option.take() along the way).

Well, turns out they already have this thing, right there in the standard library, working on top of any basic Iterator. Now my parser holds a Peekable<Lexer> initialized simply with lexer(file).peekable(), and I can self.lexer.peek() as well as self.lexer.next(). Great! Love deleting custom code!

By the way, you can have peekable in Python too, just not from the standard library.

TODO

The library is shaping up nicely, here's what's next:

Call for help

At this point I'd really love some code review and expert advise from fellow Rustians. If you have time and want to help, please feel free to file pull requests or just leave comments here.

Thank you!

Comments: 2 (noteworthy: 1)

  1. Valerii Hiora

    Noteworthy comment

    Added a simple PR with example how pattern matching can make the code much shorter and cleaner.

    Another thing to mention is that I'd bet that Rust users would expect a streaming JSON parser which works over &[u8] or Read+Seek and doesn't create a new Vec for every lexeme.

    The same goes to the project/file structure - it might be irrelevant for a pet project, but it allows a better way to communicate for others. For example, this time I was caught by the fact that test is actually should be run through cargo run, while I mechanically used a hotkey for cargo test and missed an logic error :-)

    Back to the post...

    Well, this is next(&mut self) — a function taking a mutable reference and you can't have more than one, ever. So no, you can't call that one recursively, you'll probably have to extract that part of logic in another function and go through some amount of yak shaving making sure it's pure and doesn't want a mutable self.

    To clarify, you still can use functions with mutable borrowing if compiler can prove that borrow has an appropriate lifetime. See example of this in assert_top_eq in PR.

    What you can't do though is to simply check the value of an enum with an if

    You can, if you either derive or implement Eq for an enum. It's quite useful for simple C-like enums like state.

    The primary purpose of if let has more to do with destructuring rather than simple checking for a variant.

    For example, you should use it if you need to filter based on "internal data" and/or interested to use only a couple of fields. Here is an example, I apologize if it doesn't look natural, but I hope it clearly demonstrates that if let and if solve pretty different problems.

  2. Ivan Sagalaev

    Another thing to mention is that I'd bet that Rust users would expect a streaming JSON parser which works over &[u8] or Read+Seek and doesn't create a new Vec for every lexeme.

    That actually turned out to be more complicated than I thought before.

    First of all, Iterator simply doesn't support streaming over a buffer. Since fn next(&mut self) doesn't specify a lifetime parameter you can't return anything thus parametrized from it while still satisfying the trait interface. So I'll have to give up on the idea of Lexer impl'ing an Iterator. Which is no big deal actually, it's a library internal API.

    However there's another problem: when a lexeme falls onto a buffer boundary I have to keep its first part in memory separately anyway before I recycle the buffer to look for the rest of the lexeme. It'd probably still worth it performance-wise but I decided to leave this optimization out for now.

    The same goes to the project/file structure - it might be irrelevant for a pet project, but it allows a better way to communicate for others.

    Sure thing. It'll work out itself eventually and I'll probably start my next project properly next time when I'm more familiar with all the conventions.

    To clarify, you still can use functions with mutable borrowing if compiler can prove that borrow has an appropriate lifetime.

    Yes, thanks for pointing this out. Right now I can't even reproduce the bug that I was having. Looks like I simply assumed you can't do it at all without really understanding what was wrong.

    You can, if you either derive or implement Eq for an enum.

    That was useful too, thanks!

Add comment