Маниакальный веблог » ijson in Rusthttps://softwaremaniacs.org/blog/category/ijson-in-rust/2016-02-11T22:17:07.557000-08:00ManiacИван Сагалаев о программировании и веб-разработкеhttp://softwaremaniacs.org/media/sm_org/style/photo.jpgijson in Rust: typed lexer
2016-02-11T22:15:56.452000-08:00https://softwaremaniacs.org/blog/2015/11/11/ijson-in-rust-typed-lexer/Catching up on my Rust learning diaries… Today I'm going to tell you about releaving my Lexer from its Pythonic legacy and what tangible results it produced, beside just being The Right Thing™. Basic idea The original lexer yielded three kinds of lexemes: strings enclosed in quotes: "..." multi-character literals ...
<p>Catching up on my <a href="http://softwaremaniacs.org/blog/category/ijson-in-rust/en/">Rust learning diaries</a>… Today I'm going to tell you about releaving my Lexer from its Pythonic legacy and what tangible results it produced, beside just being The Right Thing™.</p>
<p><a name=more></a></p>
<h2>Basic idea</h2>
<p>The original lexer yielded three kinds of lexemes:</p>
<ul>
<li>strings enclosed in quotes: <code>"..."</code></li>
<li>multi-character literals and numbers: <code>[a-z0-9eE\.\+-]+</code></li>
<li>single-character lexemes: brackets, braces, commas, colons, etc.</li>
</ul>
<p>Type-wise all of them were <em>strings</em>, and it was the job of the parser to check what kind of lexemes they were: a known literal, something starting with a quote or something parsable as a number… or an error, failing all else.</p>
<p>This made total sense in Python where I, for example, just used a single generalized regexp to parse all non-string lexemes. It allowed for a very simple and readable code and it's in fact the only right way to parse byte buffers in an untyped GC-powered language where dealing with individual bytes introduces too much performance overhead. </p>
<p>In Rust though it simply felt foreign because the lexer already has an intimate understanding of what is it that it parses — something starting with <code>"</code>, or <code>+|-|0..9</code>, or <code>{</code>, … — it has to explicitly check them all anyway. Hence it seemed silly to just drop this intrinsic type information on the floor and clump everything back into strings.</p>
<p>Also I had a suspicion that it should affect performance quite significantly, as I had to allocate memory for and copy all those small string pieces. Lots of allocations and copying is never good!</p>
<h2>Process</h2>
<p>I <a href="https://github.com/isagalaev/ijson-rust/commit/02951fef7525c623b3f">started</a> by introducing a dedicated <code>Lexeme</code> type distinguishing between strings, single-character lexemes and everything else under the umbrella term "scalar" (don't grumble about the name, it was destined to go away in any case):</p>
<pre><code>#[derive(Debug, PartialEq)]
pub enum Lexeme {
String(String),
Scalar(String),
OBrace,
CBrace,
OBracket,
CBracket,
Comma,
Colon,
}
</code></pre>
<p>If anything, it made the code uglier as there were now two paradigms sitting in the code side by side: typed values and "scalars" that I had to handle the old way:</p>
<pre><code>match lexeme {
Lexeme::OBracket => Event::StartArray,
Lexeme::OBrace => Event::StartMap,
Lexeme::CBracket => Event::EndArray,
Lexeme::CBrace => Event::EndMap,
Lexeme::String(s) => Event::String(try!(unescape(s))),
// The Ugliness boundary :-)
Lexeme::Scalar(ref s) if s == "null" => Event::Null,
Lexeme::Scalar(ref s) if s == "true" => Event::Boolean(true),
Lexeme::Scalar(ref s) if s == "false" => Event::Boolean(false),
Lexeme::Scalar(s) => {
Event::Number(try!(s.parse().map_err(|_| Error::Unknown(s))))
},
_ => unreachable!(),
}
</code></pre>
<p>Next, the string un-escaping business has been moved entirely into lexer. Even though it was a pretty much verbatim move of a bunch of code from one module to another, it made it obvious that I'm actually processing escaped characters <em>twice</em>: first simply to correctly find a terminating <code>"</code> and then to decode all those escapes into raw characters. This proved to be a good optimization opportunity later. It never ceases to amaze me how such simple refactorings sometimes give you a much better insight! Do not ignore them.</p>
<p>Finally, I <a href="https://github.com/isagalaev/ijson-rust/commit/e99526c3415c2cb14b36b2e3c276c071dd99558f">split <code>Lexeme::Scalar</code> into honest numbers, booleans and the null</a>. The code got more readable and more idiomatic all over, and there was much rejoicing!</p>
<h2>Bumps along the road</h2>
<p>During all those refactorings I had to constantly fiddle with error definitions (of which <a href="http://softwaremaniacs.org/blog/2015/08/26/ijson-in-rust-errors/en/">I wasn't a fan</a>, to begin with). Changing wrapped error types and types of error parameters — all this really fun stuff, you know… This is the price of using a strongly-typed language: having the knowledge codified in two forms, type declarations and the code itself.</p>
<h2>Performance</h2>
<p>I didn't do this entire exercise just for feeling better about more idiomatic code (though that'd be a reason enough). It was actually the first step in the ongoing performance optimization endeavour that I promised last time.</p>
<p>Cutting right to the chase, this change gave me a <strong>1.5x</strong> performance gain on my 18MB test JSON. Still a far cry though from my reference point, yajl:</p>
<table>
<tr><td>Before refactoring<td>0.290 secs
<tr><td>After refactoring<td>0.193 secs
<tr><td>yajl<td>0.051 secs
</table>
<p>As far as I understand, this gain can be attributed entirely to removing allocations of temporary strings for single-character lexemes. Though I didn't investigate it properly at that point.</p>
<p>See you next time for more optimization fun!ijson in Rust: errors
2016-02-11T22:16:07.556000-08:00https://softwaremaniacs.org/blog/2015/08/26/ijson-in-rust-errors/While I fully expected to have difficulty switching from the paradigm of raising and catching exception to checking return values, I wasn't ready to Rust requiring so much code to implement it properly (i.e., by the bible). So this is the one where I complain… Non-complaint I won't be talking ...
<p>While I fully expected to have difficulty switching from the paradigm of raising and catching exception to checking return values, I wasn't ready to Rust requiring so much code to implement it properly (i.e., by <a href="http://blog.burntsushi.net/rust-error-handling/">the bible</a>).</p>
<p>So this is the one where I complain…</p>
<p><a name=more></a></p>
<h2>Non-complaint</h2>
<p>I won't be talking about exceptions vs. return values per se. For a language that won't let you omit cases in a <code>match</code> and where type safety is paramount it totally makes sense to make programmers deal with errors explicitly. Even if it's just saying "drop everything on the floor right here", it's done with an explicit call to <code>panic!</code> or <code>unwrap()</code> so you can go over them later with a simple text search and replace with something more sensible.</p>
<p>So if you're coming to it from a dynamic language, like I am, my best advice is to not get upset every time when you have to stop and reset the pipeline inside your brain to think about every little unwelcome <code>Result::Err</code> that just refuses to go away. Get used to it :-)</p>
<p>As a result, my code changed <strong>a lot</strong> in <a href="https://github.com/isagalaev/ijson-rust/compare/builder...errors">myriad of places</a> (not even counting a whole new "errors" module). And it prompted me to consider enforcing tighter invariants on the module boundaries. For example, I now see that instead of dispatching lexemes as <code>Vec<u8></code> leaving handling potential utf-8 conversion errors to multiple consumers, it's better to contain conversion within the lexer module and only handle this kind of errors in one place (I haven't done it yet).</p>
<h2>Boilerplate</h2>
<p>To the bad part, then…</p>
<p>It is imperative that a library should wrap all the different kinds of errors that might occur within it into an overarching library-specific error type. Implementing it in Rust is straightforward but very laborious.</p>
<p>This is my Error type:</p>
<pre><code>pub enum Error {
IO(io::Error),
Utf8(str::Utf8Error),
Unterminated,
Escape(String),
Unexpected(String),
MoreLexemes,
Unmatched(char),
AdditionalData,
}
</code></pre>
<p>To make it actually useful I had to:</p>
<ul>
<li>For all eight variants write a line converting it into a string for display purposes (<code>Display::fmt</code>). All of those lines are unsurprisingly similar looking.</li>
<li>Associate with all of them a short textual description that is slightly different from the one above for no apparent reason.</li>
<li>For the two first variants that wrap lower level errors I had to explicitly write logic saying that their wrapped errors are in fact their immediate causes.</li>
<li>For the same two lower level errors I had to explicitly state that they are convertible into my <code>Error</code> type using those two first variants. That means a separate single-method 4-line <code>impl</code> for each.</li>
</ul>
<p>This took <a href="https://github.com/isagalaev/ijson-rust/blob/errors/src/errors.rs#L14-L75">62 lines</a> of mostly boilerplate and repetitive code.</p>
<p>I do feel though that all of this not only should, but <em>could</em> be implemented as some heavily magical <code>#[derive(Error)]</code> macro, at least to an extent. Might be a good project in itself…</p>
<h2>There is no <code>try!</code></h2>
<p>The <a href="https://doc.rust-lang.org/std/macro.try!.html"><code>try!</code></a> macro goes a long way towards releaving you from a burden of doing an obvious thing most of the time you encounter an unexpected error, namely returning it immediately up the stack:</p>
<pre><code>fn foo() -> Result<T> {
let x = try!(bar()); // checks if bar() resulted in an error and `return`s it, if yes
// work with an unwrapped value of x safely
}
</code></pre>
<p>However, since it expands into code containing <code>return Result::Err(...)</code> it <em>only</em> works inside a function that returns <code>Result</code>.</p>
<p>Alas, the core method of Iterator — <code>next()</code> — is defined to return a different type, <code>Option</code>. Which means that you can't use <code>try!</code> if you're implementing an iterator. So I had to write my own local variety — <a href="https://github.com/isagalaev/ijson-rust/blob/master/src/errors.rs#L5"><code>itry!</code></a>.</p>
<h2>Stopping iterators</h2>
<p>Another problem with iterators is that the process doesn't automatically stop upon receving an error code from the latest iteration. This seems natural to me although I have to say that the good folks at the Rust users forum are convincing me that <a href="https://users.rust-lang.org/t/handling-errors-from-iterators/2551/15">my case is probably not general</a>.</p>
<p>Anyway, I know that my iterators certainly <em>should</em> stop upon the error so I had to implement a <a href="https://github.com/isagalaev/ijson-rust/blob/errors/src/errors.rs#L79-L106">simple wrapper</a> to watch the iterator for errors and start returning <code>None</code>s from then on.ijson in Rust: object builder
2016-02-11T22:16:20.857000-08:00https://softwaremaniacs.org/blog/2015/07/09/ijson-in-rust-builder/Object builder is what makes a parser actually useful: it takes a sequence of raw parser events and creates strings, numbers, arrays, maps, etc. With this, ijson is functionally complete. Filtering Ijson can filter parser events based on a path down JSON hierarchy. For example, in a file like this: ...
<p>Object builder is what makes a parser actually useful: it takes a sequence of raw parser events and creates strings, numbers, arrays, maps, etc. With this, <a href="http://softwaremaniacs.org/blog/category/ijson-in-rust/en/">ijson</a> is <a href="https://github.com/isagalaev/ijson-rust/tree/builder">functionally complete</a>.</p>
<p><a name=more></a></p>
<h2>Filtering</h2>
<p>Ijson can filter parser events based on a path down JSON hierarchy. For example, in a file like this:</p>
<pre><code>[
{
"name": "John",
"friends": [ ... ]
},
{
"name": "Mary",
"friends": [ ... ]
}
... gazillion more records ...
]
</code></pre>
<p>… you would get events with their corresponding paths:</p>
<pre><code>"" StartArray
"item" StartMap
"item" Key("name")
"item.name" String("John")
"item" Key("friends")
"item.friends" StartArray
"item.friends.item" ...
"item.friends" EndArray
"item" EndMap
... ...
"" EndArray
</code></pre>
<p>In Python I implemented this by simply pairing each event with its path string in a tuple <code>(path, event)</code>. In Rust though it <em>feels wrong</em>. The language makes you very conscious of every memory allocation you make, so before even having run a single benchmarks I already worry about creating a new throwaway String instance for every event.</p>
<p>Instead, my filtering parser now accepts a target path, <a href="https://github.com/isagalaev/ijson-rust/blob/builder/src/builder.rs#L95">splits it and keeps it as a vector of strings</a> which it then <a href="https://github.com/isagalaev/ijson-rust/blob/builder/src/builder.rs#L28">compares with a running path stack</a> which it maintains through the whole iterating process. Maintaining a path stack — also a vector of string — still feels slow but at least I don't join those strings constantly for the sole purpose of comparing.</p>
<p>By the way, I was pleasantly surprised to find two handy functions in Rust's stdlib:</p>
<ul>
<li>
<p><code>String::split_terminator</code> which works better than the regular <code>String::split</code> for empty strings, as I want an empty vector in this case:</p>
<pre><code>"".split(".") // [""]
"".split_terminator(".") // []
</code></pre>
</li>
<li>
<p><code>Vec::starts_with</code> which has the same semantics as <code>String::starts_with</code> but compares values in a vector. Python doesn't have it, so I somewhat hastily implemented it only to find it in the docs after it was done. Oh well :-)</p>
</li>
</ul>
<h2>Building objects</h2>
<p>By now I've flexed my instincts enough so I could write the builder function recursively. It might not seem like a big achievement but I still remember the times just a few weeks ago when I just couldn't persuade the borrow checker to let me do something very similar while I was writing the parser! Now I can't even remember what the problem was. Something silly, for sure…</p>
<p>The <a href="https://github.com/isagalaev/ijson-rust/blob/builder/src/builder.rs#L58">function itself</a> is short but convoluted with <em>slightly</em> ugly differences between handling array and maps (the latter even has the <a href="https://github.com/isagalaev/ijson-rust/blob/builder/src/builder.rs#L69"><code>unreachable!</code> kludge</a> to satisfy the compiler).</p>
<h2>Magical unwrapping</h2>
<p>There's a general problem with deserializing any stream of bytes in a statically typed language: what type should a hypothetical <code>parse_json(blob)</code> return? The answer is, it depends on whatever is in the "blob" and you don't know that in advance.</p>
<p>As far as I know there are two ways of dealing with it:</p>
<ul>
<li>
<p>Wrap all possible value types in a tagged union and confine yourself to tedious unwrapping values on every access: <code>value.as_array().get(0).as_map().get("key").as_int()</code>.</p>
</li>
<li>
<p>Provide a schema for every format you expect from the wire and let some tool generate typed code deserializing bytes into native values of known types.</p>
</li>
</ul>
<p>Since I'm writing a generic JSON parser I went ahead with <a href="https://github.com/isagalaev/ijson-rust/blob/44384c31b4f289d92b425ad62c6b9c63511c00b2/src/builder.rs#L8">wrapped values</a>, leaving unwrapping to a consumer of the library. But then I've found a magical (if badly named) library — <a href="https://doc.rust-lang.org/rustc-serialize/rustc_serialize/index.html">rustc-serialize</a> that can <em>automatically</em> unwrap JSON values into an arbitrarily complex native type:</p>
<pre><code>#[derive(RustcDecodable)]
struct Person {
name: String,
friends: Vec<String>,
}
let f = File::open("people.json").unwrap();
let json = Parser::new(f).items("item").next().unwrap();
let result: Person = decode(json).unwrap(); // ← magic happens here
</code></pre>
<p>Let me make it clear: it doesn't just unwrap the top-level <em>struct</em> it does it all the way down, so in <code>friends</code> you get a real vector of strings, not json-vector of json-strings. Isn't that cool?!</p>
<p>Magic consists of two parts:</p>
<ul>
<li><code>derive(RustcDecodable)</code> is some macro-thingie that generates code specific to this particular struct that unwraps a JSON value of the same structure. </li>
<li><code>decode(json)</code> is a generic function that works for all decodable types defined in the code, and Rust automatically picks up the right implementation knowing that the result is assigned to a <code>Person</code> variable.</li>
</ul>
<p>Come to think of it, this is in fact the very same "schema + codegen" option with the schema being described directly in Rust and code being generated by the macro system instead of relying on some pesky cross-language IDL and stub-generating build scripts. (Yeah, I still remember Microsoft COM and CORBA :-) )</p>
<h2>Splendours and miseries of traits</h2>
<p>To expose the builder interface I decided to exercise the power of Rust's traits. Instead of hard-glueing the <code>items(prefix)</code> method to <code>Parser</code> I wanted it to work for any type that is an iterator of parser events:</p>
<pre><code>parser.items(""); // Parser itself
parser.prefix("root").items(""); // my own preifxed wrapper around parser
parser.filter(predicate).items(""); // Rust's stdlib Filter type
</code></pre>
<p>In a language that couples interfaces with type definitions (e.g. Java) the last line wouldn't be possible as <code>filter(..)</code> is something declared in the stdlib and it has no idea about my local interfaces.</p>
<p>In a duck-typed language (quack! quack!) it would work by asking an object at run time to turn itself into an iterator and treating whatever it would yield as events. No guarantees of any kind, but very flexible and with no declarations necessary.</p>
<p>Here's where Rust's splendour comes in: you can describe your trait (a.k.a. interface) generically so it will be applicable to any type meeting your conditions, no matter where it is defined:</p>
<pre><code>pub trait Builder where Self: Sized + Iterator<Item=Event> {
fn items(self, prefix: &str) -> Items<Prefix<Self>> { ... }
}
</code></pre>
<p><code>Self</code> here denotes the type of an object that this trait can be glued onto. We don't specify any base type for that, instead we describe a condition: <code>Sized</code> and <code>Iterator<Item=Event></code> are the traits that this type must have in order to accept a Builder trait. So this literally says that the Builder trait is applicable to any type that is an iterator of parser events (forget about <code>Sized</code> for now.)</p>
<p>This isn't enough, however. A trait itself is only a description of an interface, and usually it needs a separate implementation for every type you want it to work with. However my trait is different: it doesn't really need to know anything about the concrete type of <code>Self</code>, it has all its methods already implemented using features provided by the Iterator trait. Still, even in the case where there's nothing to implement I had to explicitly tell Rust to consider the Builder trait implemented for any type it's implementable for:</p>
<pre><code>impl<T> Builder for T where T: Sized + Iterator<Item=Event> {}
</code></pre>
<p>All those repeating impls, angle brackets and types feel like boilerplate. And even though it seems like a small price for great flexibility, the hard part is actually finding how these things are supposed to be done. It usually means looking at other code that does something similar. Or bothering other people :-)</p>
<p>Another problem is that the origin of trait methods is completely undiscoverable using the code alone, because you have to import the trait, not individual methods:</p>
<pre><code>use ijson::parser::Builder;
parser.items(""); // Where did items come from? No idea...
</code></pre>
<p>Without help from some clever IDE you're left with guessing and reading docs for all the traits you've got imported in the file.</p>
<p>To be honest, I'd prefer pure functional interface to all this machinery. So that <code>items()</code>, <code>filter()</code>, <code>prefix()</code> would be stand-alone functions without the need of describing traits grouping them together. But method chaining seems to be idiomatic to Rust so I decided to stick with it.</p>
<h2>One last wart</h2>
<p>Turns out there are no macros for initializing maps of any kind! While you can easily initialize a vector:</p>
<pre><code>let v = vec![1, 2, 3];
</code></pre>
<p>… a map is going bore you to death before you even get to the third element:</p>
<pre><code>let m = HashMap::new();
m.insert("key", "value");
m.insert("key2", "value2");
// ...
</code></pre>
<p>On the other hand, you're much more likely to define custom structs instead of relying on ad-hoc maps.</p>
<h2>What's next</h2>
<p>First, I want to replace all the <code>unwrap</code>s and <code>panic!</code>s with the proper Rustian error handling. Expect some rants!</p>
<p>And then I want to spend some time optimizing performance. Running some quick tests showed that my horribly unoptimized code is only 4 times slower than <a href="https://gist.github.com/isagalaev/523a93f837976c9b3682">C using yajl</a>. I've been expecting much worse, to be honest!ijson in Rust: unescape
2016-02-11T22:16:34.217000-08:00https://softwaremaniacs.org/blog/2015/05/28/ijson-in-rust-unescape/Today marks a milestone: with implementation of string unescaping my json parser actually produces entirely correct output! Which doesn't necessarily mean it's easy to use or particularly fast yet. But one step at a time :-) The code came out rather small, here's the whole function (source): fn unescape(s: &str) ...
<p>Today marks a milestone: with implementation of string unescaping my <a href="http://softwaremaniacs.org/blog/category/ijson-in-rust/en/">json parser</a> actually produces entirely correct output! Which doesn't necessarily mean it's easy to use or particularly fast yet. But one step at a time :-)</p>
<p><a name=more></a></p>
<p>The code came out rather small, here's the whole function (<a href="https://github.com/isagalaev/ijson-rust/blob/unescape/src/main.rs#L128">source</a>):</p>
<pre><code>fn unescape(s: &str) -> String {
let mut result = String::with_capacity(s.len());
let mut chars = s.chars();
while let Some(ch) = chars.next() {
result.push(
if ch != '\\' {
ch
} else {
match chars.next() {
Some('u') => {
let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap());
char::from_u32(value).unwrap()
}
Some('b') => '\x08',
Some('f') => '\x0c',
Some('n') => '\n',
Some('r') => '\r',
Some('t') => '\t',
Some(ch) => ch,
_ => panic!("Malformed escape"),
}
}
)
}
result
}
</code></pre>
<h2>Likes</h2>
<p>Luckily I can make an educated guess about how much memory my resulting string would occupy and allocate it at once with <code>String::with_capacity()</code>. It works because <code>s.len()</code> gives me the length of a UTF-8 string <em>in bytes</em>, so my output is guaranteed to be equal or smaller than the source, because:</p>
<ul>
<li>raw UTF-8 characters are left intact</li>
<li><code>\n</code>, <code>\t</code>, etc. are translated into one byte from two</li>
<li><code>\uXXXX</code> become UTF-8 sequences which occupy less or equal than the original 6 bytes</li>
</ul>
<p>Look ma, no re-allocations!</p>
<h2>Char by char iteration</h2>
<p>I seriously don't like having to <code>result.push</code> every single byte even for strings containing no \-escapes whatsoever (which is the vast majority of strings in the real-world JSON). I'd like to be able to walk through a source string and either a) copy chunks between <code>\</code> in bulk or b) if there's none found simply return the source slice converting it to an owned string with <code>to_owned()</code>. But I wasn't yet able to figure out how to approach that.</p>
<p>By the way, I find <code>while let Some(ch) = chars.next()</code> rather brilliant! It loops as long as the iterator returns something that can be destructured into a usable value and handily binds the latter to a local var.</p>
<p>Also, XMPPwocky at #rust IRC channel suggested "to write something on top of a <code>Reader</code>" and "specifically something over a <code>Cursor<Vec<u8>></code>, actually". Though that was prompted by an entirely different discussion.</p>
<h2>Non-obvious <code>.by_ref()</code></h2>
<p>There's this long line in the middle that converts four bytes after <code>\u</code> into a corresponding char:</p>
<pre><code>let value = chars.by_ref().take(4).fold(0, |acc, c| acc * 16 + c.to_digit(16).unwrap());
</code></pre>
<p>What happened without <code>by_ref()</code> was this line <em>stole ownership</em> of <code>chars</code> from the outer <code>while</code> loop, and Rust didn't let me use <code>chars</code> anywhere else.</p>
<p class=note><small>If you aren't familiar with the concept of ownership in Rust, head over to <a href="http://doc.rust-lang.org/book/ownership.html">the official explanation</a>.</small></p>
<p>That was rather surprising because my gut feeling is (or was) that <code>.take(4)</code> is hardly any different than calling <code>.next()</code> four times in a loop, and yet the latter leaves the original iterator alone with its owner.</p>
<h2>Hex conversion</h2>
<p>You may notice that I convert hex numbers into chars manually with <code>.fold()</code> (aka "reduce" in other languages) even though Rust has <a href="http://doc.rust-lang.org/std/primitive.u32.html#method.from_str_radix"><code>from_str_radix(16)</code></a> for that. I <a href="https://github.com/isagalaev/ijson-rust/blob/d310bbb3c9cca53155da1f5af26ff3877e9c475f/src/main.rs#L137-L138">used it at first</a> but I had to use a separate <code>&str</code> which I was only able to get by allocating a temporary String. I didn't like an extra allocation so I resorted to the manual way which, frankly, isn't all that bad.ijson in Rust: the parser
2016-02-11T22:16:48.021000-08:00https://softwaremaniacs.org/blog/2015/05/21/ijson-in-rust-parser/It took me a while but I've finally implemented a working parser on top of the lexer in my little Rust learning project. I learned a lot and feel much more comfortable with the language by now. In the meantime I even managed to get out to a Rust Seattle ...
<p>It took me a while but I've finally implemented a <a href="https://github.com/isagalaev/ijson-rust">working parser</a> on top of the lexer in my little <a href="http://softwaremaniacs.org/blog/2015/04/15/ijson-in-rust/en/">Rust learning project</a>. I learned a lot and feel much more comfortable with the language by now. In the meantime I even managed to get out to a Rust Seattle meetup, meet new folks and share an idea about doing some coding together in the future. Let's see how it'll work out.</p>
<p><a name=more></a></p>
<p><a name=yield></a></p>
<h2>Power of <code>yield</code></h2>
<p>First, a digression. It's not about Rust at all, but I'll get to my point eventually, I promise!</p>
<p>When I was coding the <a href="https://github.com/isagalaev/ijson/blob/master/ijson/backends/python.py">same problem in Python</a> I didn't fully appreciate the expressive power of generators, I simply used them because it seemed to be the most natural way. Have a look:</p>
<pre><code>def parse_value(lexer, symbol=None):
// ...
elif symbol == '[':
yield from parse_array(lexer)
elif symbol == '{':
yield from parse_object(lexer)
// ...
def parse_array(lexer):
yield ('start_array', None)
symbol = next(lexer)
if symbol != ']':
while True:
yield from parse_value(lexer, symbol)
symbol = next(lexer)
if symbol == ']':
break
if symbol != ',':
raise UnexpectedSymbol(symbol)
symbol = next(lexer)
yield ('end_array', None)
</code></pre>
<p>Parsing JSON (or almost anything, for that matter) requires keeping a state describing where in the structure you are now: are you expecting a scalar value, or an object key, or a comma, etc. Also, since arrays and objects can be nested you have to keep track of them opening and closing in the correct order in a some sort of stack.</p>
<p>Magic of the <code>yield</code> keyword lets you leave a function and then return to the same place, implicitly giving you both the state and the stack for free:</p>
<ul>
<li>
<p>The state is represented by an execution point. For example, after you yielded a <code>'start_array'</code> event, the next iteration will continue from the same place, ready to check for a closing bracket or the first value in the array. In other words current state is described by the last executed <code>yield</code>.</p>
</li>
<li>
<p>The stack is represented by, well, your runtime's call stack: you call <code>parse_array</code> from <code>parse_value</code> and you'll be back whenever that nested array finishes parsing. No need to check for wrong values or an empty stack.</p>
</li>
</ul>
<p>With both of those facilities out of the way the code simply represents the grammar in the natural order. All of that thanks to the semantics of <code>yield</code>.</p>
<h2>Going the hard, explicit, low-level way</h2>
<p>Rust doesn't have <code>yield</code>. Which means iteration is implemented by repeatedly calling the <code>next()</code> function, and you have to explicitly keep both the state and the stack in the iterator object between the calls.</p>
<p>This is where I spent some number of days trying different approaches, figuring out which states I need, whether I need a loop processing non-significant lexemes like <code>:</code> and <code>,</code> or a recursive call to <code>next()</code> would do the job, stuff like that. It was hard, partly because of my unfamiliarity with the language and partly because I'm definitely not the best algorithmist out there. But ultimately there's always a price you pay in productivity when working in a typed language, especially the one with a strict policy on using pointers.</p>
<p>In Python, I tend to work top-down, starting with roughly sketching the whole algorithm making it just barely runnable to see if it works at all as early as possible. I rely on the language letting me be sloppy with types and error handling and leaving whole parts of the code essentially non-working if I'm not going to run them just yet.</p>
<p>In Rust, the compiler doesn't want to hear your pleads and promises that you're going to cleanup the mess later: <em>everything</em> must be tidied up and compiled, period. Want to play with adding a flag to one of the states? Sure, just go ahead and update the definition a couple of screens up and the initialization a couple of screens down. Want to see if you can call that code recursively? Well, this is <code>next(&mut self)</code> — a function taking a <em>mutable</em> reference and you can't have more than one, ever. So no, you can't call that one recursively, you'll probably have to extract that part of logic in another function and go through some amount of yak shaving making sure it's pure and doesn't want a mutable <code>self</code>. At which point it doesn't look like a quick checking out anymore.</p>
<p>Constant context switching between thinking about overall architecture and implementation details, such as reference herding, is the hardest part of Rust for me right now. I think it's unavoidable, even though I'm getting better at it :-)</p>
<p>It's all not in vain, of course. Types <em>do</em> help in reasoning about code. If you see a function taking an immutable reference you know it won't change it without looking through its code. You also know it won't suddenly become non-pure later on.</p>
<h2>Enums</h2>
<p>Enough whining, though. If there's one thing that I'm completely in love right now it's <a href="http://doc.rust-lang.org/book/enums.html">enums</a>! What's so exciting about <em>enums</em>, you ask? Well, first of all, they're misleadingly named. They are really <a href="http://www.wikiwand.com/en/Tagged_union">tagged unions</a> that represent a value coming in several variants. Here's an example.</p>
<p>As my parser goes through a JSON it yields <em>events</em>, like a start of an object, a key, a number, a boolean, an end of an object, etc. You want to know two things about an event: what it is and, for somey, the actual value associated with it. Here's how the type looks in Rust:</p>
<pre><code>enum Event {
Null,
Boolean(bool),
String(String),
Key(String),
Number(f64),
StartArray,
EndArray,
StartMap,
EndMap,
}
</code></pre>
<p>The wonderful part is that when processing this you get safe typed values without doing any casting:</p>
<pre><code>match event {
Boolean(value) => // `value` is bool
Number(value) => // `value` is f64
StartArray => // you don't need no values here, so you're not getting any
// ...
}
</code></pre>
<p>Neat, right!? And <a href="https://doc.rust-lang.org/book/patterns.html">pattern matching</a> can be way more elaborate, by the way.</p>
<p>What you can't do though is to simply check the value of an enum with an if:</p>
<pre><code>if self.state == State::Closed { ... }
// error: binary operation `==` cannot be applied to type `State` [E0369]
</code></pre>
<p>This tripped me up pretty severely at one point when I organized my whole logic around checking for specific states here and there. For example, an object key in JSON is not awfully different from any scalar value: you're parsing a string and then just call it "key" instead of "string". Nope, didn't let me do that, no sir. Had to reshape all that state handling into one big <code>match</code> with very similar looking parts.</p>
<p>But Rust was right, though, after all. Because <a href="https://github.com/isagalaev/ijson-rust/blob/parser/src/main.rs#L218">treating an object key</a> is in fact <em>completely</em> different from <a href="https://github.com/isagalaev/ijson-rust/blob/parser/src/main.rs#L184">treating scalar values</a> if you take into account the dynamics of states and the stack. They won't even share the string parsing code because keys don't need handling of backslash escapes.</p>
<h2><code>if let</code></h2>
<p>Rust extensively uses two enum types throughout the standard library: <a href="https://doc.rust-lang.org/std/option/enum.Option.html">Option</a> and <a href="https://doc.rust-lang.org/std/result/enum.Result.html">Result</a>. I was under the impression that inability to use <code>if</code> means that whenever you work with a function returning any of those you <em>have</em> to handle it with <code>match</code> and lose any dreams of composability.</p>
<p>However a few days ago I stumbled upon an excellent article "<a href="http://blog.burntsushi.net/rust-error-handling/">Error Handling in Rust</a>" which I wholeheartedly recommend to any Rust beginner. From it I learned about <code>if let</code>: it follows the same rules for matching as <code>match</code>, so you can write this:</p>
<pre><code>if let Event::Number(value) = event {
// handle only the case when `event` is a Number
}
</code></pre>
<p>This is actually an <em>assignment</em>, so you have <code>=</code> instead of <code>==</code> and an rvalue on the right. It can look even weirder with parameter-less enum variants:</p>
<pre><code>if let State::Closed = state {
// - Did you just assign a variable to a value, Bob?
// - Shut up and handle your closed state.
}
</code></pre>
<p>However what I miss is something like <code>if not let</code>, basically an else-clause of that if. Say, I want to pop a value from the stack and make sure it's the one I expect (and the stack is not empty, of course). Currently I do this:</p>
<pre><code>match self.stack.pop() {
Some(b'[') => (),
_ => panic!("Unmatched ]"),
}
</code></pre>
<p>All this business with the do-nothing thingie <code>()</code> and the default case <code>_ =></code> is a little bit not pretty. What I really want is this:</p>
<pre><code>if not let Some(b'[') = self.stack.pop() {
panic!("Unmatched ]") // or something more sensible than panic! when I get to it
}
</code></pre>
<p>But it's nitpicking, of course :-)</p>
<h1>Peekable lexer</h1>
<p>It occurred to me at one point that my Lexer, being an iterator, lacks the ability to tell me which lexeme would come next without actually consuming it. In other words, it wasn't "peekable". Without this, for example, I had to call out to the <a href="https://github.com/isagalaev/ijson-rust/blob/44f8ef5056abe18523ba44db5ff73fc64fd18384/src/main.rs#L210">entire non-pure process of handling another parser state</a> when, say, I got a closing brace <code>}</code> while expecting an object key.</p>
<p>So I went ahead and replaced an Iterator trait on the Lexer with a <a href="https://github.com/isagalaev/ijson-rust/commit/1e74241e80e927c2617e93b788e32fc7b299382c">passable custom implementation</a> sporting methods <code>lookup()</code> and <code>consume()</code> (learning about <a href="https://doc.rust-lang.org/std/option/enum.Option.html#method.take"><code>Option.take()</code></a> along the way).</p>
<p>Well, turns out they already have this thing, <a href="http://doc.rust-lang.org/core/iter/struct.Peekable.html">right there in the standard library</a>, working on top of any basic Iterator. Now my parser holds a <code>Peekable<Lexer></code> initialized simply with <code>lexer(file).peekable()</code>, and I can <code>self.lexer.peek()</code> as well as <code>self.lexer.next()</code>. Great! Love deleting custom code!</p>
<p class=note><small>By the way, <a href="https://pythonhosted.org/more-itertools/api.html#more_itertools.peekable">you can have <code>peekable</code> in Python</a> too, just not from the standard library.</small></p>
<h2>TODO</h2>
<p>The library is shaping up nicely, here's what's next:</p>
<ul>
<li>
<p>Processing -escapes in strings (hate them!)</p>
</li>
<li>
<p>Proper error handling. All those <code>panic!</code> strewn around the code is no good, as Andrew Gallant <a href="http://blog.burntsushi.net/rust-error-handling/">tells us</a>. It's okay only when you're just starting.</p>
</li>
<li>
<p>Becnhmarking. I'm dying to see how it compares to <a href="https://lloyd.github.io/yajl/">yajl</a>.</p>
</li>
<li>
<p>Tests. Weep all you TDD fan boys, I'm writing tests after my code! Because I first want it to work at all before I make sure it works for everyone else. For now a specially crafted <a href="https://github.com/isagalaev/ijson-rust/blob/parser/test.json">test.json</a> would do just fine.</p>
</li>
<li>
<p>Implementing the rest of ijson functionality, like <a href="https://github.com/isagalaev/ijson/blob/master/README.rst">prefixed events and an object builder for <code>items()</code></a>. I really do intend to make it a usable library.</p>
</li>
</ul>
<h2>Call for help</h2>
<p>At this point I'd really love some code review and expert advise from fellow Rustians. If you have time and want to help, please feel free to <a href="https://github.com/isagalaev/ijson-rust">file pull requests</a> or just leave comments here.</p>
<p>Thank you!ijson in Rust
2016-02-11T22:17:07.557000-08:00https://softwaremaniacs.org/blog/2015/04/15/ijson-in-rust/I had this idea for a while to learn a few modern languages by porting my iterative JSON parser written in Python as a test case. It only makes sense because, unlike most tutorials, it provides you with a real real-world problem and in the end you might also get ...
<p>I had this idea for a while to learn a few modern languages by porting my <a href="https://github.com/isagalaev/ijson">iterative JSON parser</a> written in Python as a test case. It only makes sense because, unlike most tutorials, it provides you with a <em>real</em> real-world problem and in the end you might also get a useful piece of code.</p>
<p>I started with <a href="http://www.rust-lang.org/">Rust</a>, and I already have plans to do the same with Go and Clojure afterwards. I won't be giving you any introduction to Rust though, as there's <a href="https://www.google.com/?q=rust+language+review">a lot of those around the Web</a>. I'll try to share what I didn't find in those.</p>
<p><a name=more></a></p>
<h2>Resources</h2>
<ul>
<li>The <a href="http://doc.rust-lang.org/book/">online book</a> is a very good starting material, it gives a wide shallow overview of the language principles and provides pointers on where to go next.</li>
<li>The <a href="http://doc.rust-lang.org/std/index.html">API docs</a> are essential but are hard to navigate for a beginner because Rust tends to implement everything in myriads of small interfaces. You can't simply have a flat list of everything you can do with, say, a String. Instead, you get the top level of a <a href="http://doc.rust-lang.org/std/str/">non-obvious hierarchy of features</a>. One of the way around that is asking Google things like "rust convert int to str" or "rust filter sequence".</li>
<li>When Google doesn't help you've got a <a href="http://users.rust-lang.org/">user forum</a> and the IRC channel #rust on irc.mozilla.org. Both are very much alive and haven't yet failed me a single time!</li>
</ul>
<h2>Lexer</h2>
<p>After a few days of fumbling around, feeling incredibly dense and switching from condemning the language to praising it every 15 minutes I've got a <a href="https://github.com/isagalaev/ijson-rust/blob/4195be607cbf3f1068d8cd0d9c35952461be1390/src/main.rs">working JSON lexer</a>. It's still in the playground mode: short, clumsy and not structured in any meaningful way. Following are some notes on the language.</p>
<h2>Complexity</h2>
<p>The amount of control you have over things is staggering. Which is another way of saying that the language is rather complicated. From the get-go you worry about the difference between strings and string slices (pointers into strings), values on the heap ("boxed") vs. values on the stack and dynamic vs. static dispatch for calling methods of traits. Feels very opposite to what I'm used to in Python, but on the other hand I <em>did</em> try to come out of my comfort zone :-)</p>
<p>Here's a little taste of that. The Lexer owns a fixed buffer of bytes within which it searches for lexemes and returns them one by one as steps of an iteration. So my first idea was to define an iterator of pointers ("slices" in Rust parlance) into that buffer to avoid copying each lexeme into its own separate object:</p>
<pre><code>impl Iterator for Lexer {
type Item = &[u8]; // a "pointer" to an array of unsigned bytes
// ...
}
</code></pre>
<p>This turns out to be <em>impossible</em>, because Rust wants to know the lifetime of pointers but in this case it simply can't tell how a yielded pointer is related to the lifetime of the Lexer's internal buffer. It doesn't know who and how has created that Lexer object, it is not guaranteed to be the same block of code that now iterates over it. Since you can't have a pointer to something in limbo, you have to construct a dynamic, <em>ownable</em> vector of bytes and return it from an iteration step, so a consumer would hold onto it independent of the source buffer:</p>
<pre><code>impl Iterator for Lexer {
type Item = Vec<u8>; // a growable vector of bytes
fn next(&mut self) -> Option<Vec<u8>> { // don't mind the Option<> part
let mut result = vec![];
// ....
result.extend(self.buf[start..self.pos].iter().cloned()); // more on that later…
Some(result)
}
}
</code></pre>
<p>By the way, this is the kind of learning experience that only comes with a <em>real</em> real-world task. Tutorials tend to avoid this kind of messiness.</p>
<p>About that <code>.iter().clone()</code> thing… Turns out there are quite a few subtly different ways of pushing an array of bytes into a vector:</p>
<ul>
<li><code>vector.push_all()</code> is the easiest one but it complains that it's being deprecated in favor of <code>.extend()</code>;</li>
<li><code>vector.extend()</code> wants an iterator, which you have to create explicitly with <code>.iter()</code> which would yield pointers to bytes instead of bytes, so you have to explicitly dereference and copy them with <code>.clone()</code>;</li>
<li><code>vector.write()</code> does accept an array of bytes, but(!) since it's an implementation of an I/O protocol it might return an error that Rust won't let you ignore silently, even though it can't really happen here.</li>
</ul>
<p>Compare this to something like <code>vector.extend(iterable)</code> in Python where <code>iterable</code> can be either an iterator or something that can become an iterator — the language doesn't care. It's an excellent example of what people mean talking about "better productivity of dynamic languages". (Which comes at the price of performance, of course — I'm not trying to pretend that there's one single true way here.)</p>
<h2>Polymorphism</h2>
<p>Rust solves the <a href="http://c2.com/cgi/wiki?ExpressionProblem">Expression problem</a> with "traits". Traits are almost the same thing as interfaces in Java except that objects themselves don't claim to adhere to interfaces. Instead, you first define your data structures and then implement traits <em>for them</em> separately (Clojurists would recognize their Protocols here). This separation allows you to bind your traits to external objects and vice-versa. It also means that methods tend to group into rather fine-grained bundles. It does feel kind of right but might complicate maintenance and is harder to document.</p>
<p>What's really refreshing though and what feels <em>definitely</em> right is that there's no classical object inheritance. Begone rigid class hierarchies!</p>
<h2>Error handling</h2>
<p>Rust uses the idea of return values as <a href="http://en.wikipedia.org/wiki/Tagged_union">tagged unions</a> representing both a successful result and an error and leaves it to the caller to analyze it. Thankfully, Rust enforces error checking: you simply cannot access a successful value implicitly without indicating how you want to deal with an error.</p>
<p>You either have to manually match all possible variants of a result (the compiler won't let you omit the <code>Error</code> part):</p>
<pre><code>let result = str::from_utf8(bytes);
match result {
Error(e) => // deal with a decoding error `e`
Ok(s) => // deal with a string `s`
}
</code></pre>
<p>Or you can use various helpers:</p>
<pre><code>// Turn Ok/Error thing into a subtly different form <YourType>/None, losing the error
let s = str::from_utf8(bytes).ok();
// Insist further on having a non-empty value defaulting to a complete program stop
// (panic!) if it's not available
let s = str::from_utf8(bytes).ok().unwrap();
// Obtain a final value of your type doing an early return from the function
// in case of an error
let s = try!(str::from_utf8(bytes));
</code></pre>
<p>To be honest, I'm not a fan of this approach and am personally very much content with traditional exceptions. But that calls for a whole another blog post though (already drafted). For now I tend to use <code>panic!()</code> for every error and not bother but I will have to replace them all eventually with something more in line with the Rust philosophy.</p>
<p>Speaking of exceptions, tracebacks (or "stack traces") in Rust are displayed backwards, as <a href="http://yellerapp.com/posts/2015-01-22-upside-down-stacktraces.html">in most languages</a>. The problem is aggravated by the Rust compiler having <em>very</em> verbose error messages and trying to display as many of them as possible in one go. I understand how it makes sense if what you're compiling is a huge browser engine but I'm really tired of scrolling my terminal window back and forth. I miss Python :-(</p>
<h2>Modularity</h2>
<p>Modularity in Rust isn't bound to the file system and uses somewhat misleading terminology:</p>
<ul>
<li>
<p>A <strong>crate</strong> is what I'd actually call a module or a package. It's something that compiles into a single library (or an executable). Its source can consist of any number of files located anywhere, they all declare their crate explicitly in the source.</p>
</li>
<li>
<p>A <strong>module</strong> is simply an arbitrary namespace to contain some stuff you want grouped for some reason. A file can define several modules, and they can be nested.</p>
</li>
</ul>
<p>What's important is that Rust comes bundled with <code>cargo</code> — an official tool that does building, packaging and installation. And there's an official packages repository. I strongly believe a language cannot be considered modern without one (sorry, C++!)