ijson on PyPy, Episode 3: New parsing

It's a funny thing when after neglecting your project for a year you get a question on whether it's orphaned and then suddenly you find yourself hacking on it for few days straight… Knowing that your work is needed and appreciated is the greatest motivator!

Anyway… The news I wanted to share is that you can now use ijson with different parsing backends and there's a sizable speed-up when running under PyPy.

Backends

Originally ijson was a ctypes wrapper around yajl which some time ago reached its next major version introducing incompatible API changes. One possible thing to do was to simply switch ijson to the new yajl 2.x API but I wanted to keep it working on current Ubuntu systems which only ship with yajl 1.x. Instead I refactored the library to have several backends so it can support both versions of yajl. The backend system has also neatly accommodated the experimental pure python parsing that used to live in a separate branch, lost and forgotten.

To use a specific backend you import it explicitly:

import ijson.backends.yajl as ijson
ijson.parse(...)

Also you can still just import ijson which should intelligently find the best backend for the current environment going through "yajl2", "yajl" and "python". This however is not yet implemented, so import ijson just defaults to yajl 1.x (and fails at it on Ubuntu 12.10 Beta that has yajl2 by default).

Speed

Tweaking the old pure python branch into a backend inspired me to run again some performance test that I did a year ago. Since this time I used a larger data sample and a modified test script the results aren't directly comparable to the old ones. I was interested in one thing in particular: how the pure python parser running under PyPy compares to the yajl-based parser running under CPython, the latter being the most obvious setup currently.

A year ago they were on par. Now, running the same old code under the new PyPy 1.9 turns out to be significantly faster:

CPython/yajl1	0.74 sec
PyPy/python	0.47 sec

Then I spent some quality time with the parser:

unified and simplified lexing, getting rid of rewinds and reading words like "false" in two passes
rewriting lexing of strings to not stumble over every backslash, which helped greatly for \u-encoded non-English text
trimming internal buffer less frequently eliminating memory copying on each such operation
and then simplified lexing some more

The main result is that the code now is much simpler and faster under PyPy grinding through those 20000 objects in almost half the time as the C library:

CPython/yajl1	0.74 sec
PyPy/python	0.38 sec

Yay!

An obligatory disclaimer. Ijson is not going to become the fastest way to parse JSON in Python since iterative parsing will always be at a disadvantage compared to a traditional approach of loading all data into memory. However ijson does scale better when you're doing a lot of parsing in parallel by requiring smaller and constant amount of memory for each parser process.

Acknowledgments

I'd like to express my sincere gratitude to Alexander Saltanov for bringing me back to working on ijson and to Douglas Crockford for creating a syntax that is so simple to parse.

ijson on PyPy, Episode 3: New parsing

Backends

Speed

Acknowledgments

Comments: 1

Add comment