It's a funny thing when after neglecting your project for a year you get a question on whether it's orphaned and then suddenly you find yourself hacking on it for few days straight… Knowing that your work is needed and appreciated is the greatest motivator!
Anyway… The news I wanted to share is that you can now use ijson with different parsing backends and there's a sizable speed-up when running under PyPy.
Originally ijson was a ctypes wrapper around yajl which some time ago reached its next major version introducing incompatible API changes. One possible thing to do was to simply switch ijson to the new yajl 2.x API but I wanted to keep it working on current Ubuntu systems which only ship with yajl 1.x. Instead I refactored the library to have several backends so it can support both versions of yajl. The backend system has also neatly accommodated the experimental pure python parsing that used to live in a separate branch, lost and forgotten.
To use a specific backend you import it explicitly:
import ijson.backends.yajl as ijson ijson.parse(...)
Also you can still just
import ijson which should intelligently find the best backend for the current environment going through "yajl2", "yajl" and "python". This however is not yet implemented, so
import ijson just defaults to yajl 1.x (and fails at it on Ubuntu 12.10 Beta that has yajl2 by default).
Tweaking the old pure python branch into a backend inspired me to run again some performance test that I did a year ago. Since this time I used a larger data sample and a modified test script the results aren't directly comparable to the old ones. I was interested in one thing in particular: how the pure python parser running under PyPy compares to the yajl-based parser running under CPython, the latter being the most obvious setup currently.
A year ago they were on par. Now, running the same old code under the new PyPy 1.9 turns out to be significantly faster:
Then I spent some quality time with the parser:
- unified and simplified lexing, getting rid of rewinds and reading words like "false" in two passes
- rewriting lexing of strings to not stumble over every backslash, which helped greatly for
\u-encoded non-English text
- trimming internal buffer less frequently eliminating memory copying on each such operation
- and then simplified lexing some more
The main result is that the code now is much simpler and faster under PyPy grinding through those 20000 objects in almost half the time as the C library:
I'd like to express my sincere gratitude to Alexander Saltanov for bringing me back to working on ijson and to Douglas Crockford for creating a syntax that is so simple to parse.