Object builder is what makes a parser actually useful: it takes a sequence of raw parser events and creates strings, numbers, arrays, maps, etc. With this, ijson is functionally complete.
Filtering
Ijson can filter parser events based on a path down JSON hierarchy. For example, in a file like this:
[
{
"name": "John",
"friends": [ ... ]
},
{
"name": "Mary",
"friends": [ ... ]
}
... gazillion more records ...
]
… you would get events with their corresponding paths:
"" StartArray
"item" StartMap
"item" Key("name")
"item.name" String("John")
"item" Key("friends")
"item.friends" StartArray
"item.friends.item" ...
"item.friends" EndArray
"item" EndMap
... ...
"" EndArray
In Python I implemented this by simply pairing each event with its path string in a tuple (path, event)
. In Rust though it feels wrong. The language makes you very conscious of every memory allocation you make, so before even having run a single benchmarks I already worry about creating a new throwaway String instance for every event.
Instead, my filtering parser now accepts a target path, splits it and keeps it as a vector of strings which it then compares with a running path stack which it maintains through the whole iterating process. Maintaining a path stack — also a vector of string — still feels slow but at least I don't join those strings constantly for the sole purpose of comparing.
By the way, I was pleasantly surprised to find two handy functions in Rust's stdlib:
-
String::split_terminator
which works better than the regularString::split
for empty strings, as I want an empty vector in this case:"".split(".") // [""] "".split_terminator(".") // []
-
Vec::starts_with
which has the same semantics asString::starts_with
but compares values in a vector. Python doesn't have it, so I somewhat hastily implemented it only to find it in the docs after it was done. Oh well :-)
Building objects
By now I've flexed my instincts enough so I could write the builder function recursively. It might not seem like a big achievement but I still remember the times just a few weeks ago when I just couldn't persuade the borrow checker to let me do something very similar while I was writing the parser! Now I can't even remember what the problem was. Something silly, for sure…
The function itself is short but convoluted with slightly ugly differences between handling array and maps (the latter even has the unreachable!
kludge to satisfy the compiler).
Magical unwrapping
There's a general problem with deserializing any stream of bytes in a statically typed language: what type should a hypothetical parse_json(blob)
return? The answer is, it depends on whatever is in the "blob" and you don't know that in advance.
As far as I know there are two ways of dealing with it:
-
Wrap all possible value types in a tagged union and confine yourself to tedious unwrapping values on every access:
value.as_array().get(0).as_map().get("key").as_int()
. -
Provide a schema for every format you expect from the wire and let some tool generate typed code deserializing bytes into native values of known types.
Since I'm writing a generic JSON parser I went ahead with wrapped values, leaving unwrapping to a consumer of the library. But then I've found a magical (if badly named) library — rustc-serialize that can automatically unwrap JSON values into an arbitrarily complex native type:
#[derive(RustcDecodable)]
struct Person {
name: String,
friends: Vec<String>,
}
let f = File::open("people.json").unwrap();
let json = Parser::new(f).items("item").next().unwrap();
let result: Person = decode(json).unwrap(); // ← magic happens here
Let me make it clear: it doesn't just unwrap the top-level struct it does it all the way down, so in friends
you get a real vector of strings, not json-vector of json-strings. Isn't that cool?!
Magic consists of two parts:
derive(RustcDecodable)
is some macro-thingie that generates code specific to this particular struct that unwraps a JSON value of the same structure.decode(json)
is a generic function that works for all decodable types defined in the code, and Rust automatically picks up the right implementation knowing that the result is assigned to aPerson
variable.
Come to think of it, this is in fact the very same "schema + codegen" option with the schema being described directly in Rust and code being generated by the macro system instead of relying on some pesky cross-language IDL and stub-generating build scripts. (Yeah, I still remember Microsoft COM and CORBA :-) )
Splendours and miseries of traits
To expose the builder interface I decided to exercise the power of Rust's traits. Instead of hard-glueing the items(prefix)
method to Parser
I wanted it to work for any type that is an iterator of parser events:
parser.items(""); // Parser itself
parser.prefix("root").items(""); // my own preifxed wrapper around parser
parser.filter(predicate).items(""); // Rust's stdlib Filter type
In a language that couples interfaces with type definitions (e.g. Java) the last line wouldn't be possible as filter(..)
is something declared in the stdlib and it has no idea about my local interfaces.
In a duck-typed language (quack! quack!) it would work by asking an object at run time to turn itself into an iterator and treating whatever it would yield as events. No guarantees of any kind, but very flexible and with no declarations necessary.
Here's where Rust's splendour comes in: you can describe your trait (a.k.a. interface) generically so it will be applicable to any type meeting your conditions, no matter where it is defined:
pub trait Builder where Self: Sized + Iterator<Item=Event> {
fn items(self, prefix: &str) -> Items<Prefix<Self>> { ... }
}
Self
here denotes the type of an object that this trait can be glued onto. We don't specify any base type for that, instead we describe a condition: Sized
and Iterator<Item=Event>
are the traits that this type must have in order to accept a Builder trait. So this literally says that the Builder trait is applicable to any type that is an iterator of parser events (forget about Sized
for now.)
This isn't enough, however. A trait itself is only a description of an interface, and usually it needs a separate implementation for every type you want it to work with. However my trait is different: it doesn't really need to know anything about the concrete type of Self
, it has all its methods already implemented using features provided by the Iterator trait. Still, even in the case where there's nothing to implement I had to explicitly tell Rust to consider the Builder trait implemented for any type it's implementable for:
impl<T> Builder for T where T: Sized + Iterator<Item=Event> {}
All those repeating impls, angle brackets and types feel like boilerplate. And even though it seems like a small price for great flexibility, the hard part is actually finding how these things are supposed to be done. It usually means looking at other code that does something similar. Or bothering other people :-)
Another problem is that the origin of trait methods is completely undiscoverable using the code alone, because you have to import the trait, not individual methods:
use ijson::parser::Builder;
parser.items(""); // Where did items come from? No idea...
Without help from some clever IDE you're left with guessing and reading docs for all the traits you've got imported in the file.
To be honest, I'd prefer pure functional interface to all this machinery. So that items()
, filter()
, prefix()
would be stand-alone functions without the need of describing traits grouping them together. But method chaining seems to be idiomatic to Rust so I decided to stick with it.
One last wart
Turns out there are no macros for initializing maps of any kind! While you can easily initialize a vector:
let v = vec![1, 2, 3];
… a map is going bore you to death before you even get to the third element:
let m = HashMap::new();
m.insert("key", "value");
m.insert("key2", "value2");
// ...
On the other hand, you're much more likely to define custom structs instead of relying on ad-hoc maps.
What's next
First, I want to replace all the unwrap
s and panic!
s with the proper Rustian error handling. Expect some rants!
And then I want to spend some time optimizing performance. Running some quick tests showed that my horribly unoptimized code is only 4 times slower than C using yajl. I've been expecting much worse, to be honest!