I happen to follow Alex Gaynor on Twitter and his ravings on speed and general awesomeness of PyPy have inspired me to a small experiment. I have this iterative JSON parser — ijson — which is a ctypes-based wrapper around C library yajl by Lloyd Hilaiel. I don't know of any pure Python iterative JSON parsers so I decided to implement one and see how it compares to yajl by speed under PyPy.
It worth noting that I intentionally left out all the popular choices for parsing JSON on Python like stdlib json and cjson and focused solely on iterative parsing.
All test results in this article are based on this 5MB JSON array of 10000 objects.
I use an ad-hoc script testjson.py that can be used both to generate the test sample and parse it:
- Generate 10000 item array:
python testjson.py 10000 > test.json
- Measure parsing time:
time python testjson.py < test.json
Tests are run on 64bit Ubuntu on Python 2.7 and PyPy 1.6.
First, I have to confess that I don't know how to write parsers in general. I successfully slept through most of my University course on formal languages so please don't kill me for this code!
The parser implementation consists of everything from the beginning of parse.py up to
basic_parse which along with everything below it is the ijson's public interface. The core parser function is
parse_value which is a Python iterator reading an input stream and yielding events like:
('start_map', None) ('map_key', 'array') ('start_array', None) ('number', 1) ...
parse_value recognizes scalar JSON values (numbers, boolean and null constants) and delegates parsing of more complex data to
parse_array. The latter two also call
The naive thing about this code is reading an input stream byte by byte: you can spot
read(1) all over the place. From the beginning I suspected it might be really slow but first I needed a correctly working parser that I could speed up later.
Here are the results of running this code on Python and PyPy compared to original yajl-based code running on Python:
|Original yajl wrapper||0.47 sec|
To test the hypothesis of slow byte by byte reading I tried to pre-load the whole file in memory using StringIO and this indeed made a huge difference:
|Python, reading from StringIO||3.05 sec|
|PyPy, reading from StringIO||1.67 sec|
Here, parsing speed under PyPy is actually pretty good compared even to C code. But loading everything in memory is of course not a solution since it kills the idea of iterative parsing.
So I decided to implement some buffering inside the parser.
The code is here.
The difference is mostly in the
Reader class that wraps an input stream and tries to read it in chunks. It stores the current chunk in
self.buffer and walks through it keeping current parsing position in
self.pos. When the position counter reaches the end of the buffer it replaces it with the new one and resets the counter.
Reader also implements two higher level functions:
nextchar() that seeks first non-whitespace character and
readuntil() that reads everything up to some terminator. They are implemented using regexps.
The result surprised me a lot!
|Python, buffering in bytearray||1.10 sec|
|PyPy, buffering in bytearray||89.50 sec (!!!)|
While 3x speed-up under Python was pretty much expected I really have no idea why it slows down so drastically under PyPy…
Update. Upon the advice from Andrey Popp I replaced a
bytearray() buffer with a string and got results on Python about 20% faster and almost as fast on PyPy. So it appears PyPy has some thoroughly inefficient implementation of
|Python, buffering in string||0.84 sec|
|PyPy, buffering in string||1.30 sec|
And at this moment I've stopped and decided to get some advice.
Here are all the results again in one table for convenience:
|Original yajl wrapper||0.47 sec|
|Direct reading from input stream|
|Reading from StringIO|
|Internal buffering in bytearray with regexp search|
|Internal buffering in string with regexp search|
Right now I only have some vague ideas:
Redesign the parser as an object accepting chunks of data and calling callback functions. May be PyPy's JIT doesn't like iterators for some reason?
Replace regexp search of characters with a sequential in-python search.
Work with buffers on a lower level. I suspect that Python code loses much speed on copying bytes in memory with things like
return ''.join(result). However this idea goes dangerously close to the point where Python code is so low-level that it's better to just write it in C.