I happen to follow Alex Gaynor on Twitter and his ravings on speed and general awesomeness of PyPy have inspired me to a small experiment. I have this iterative JSON parser — ijson — which is a ctypes-based wrapper around C library yajl by Lloyd Hilaiel. I don't know of any pure Python iterative JSON parsers so I decided to implement one and see how it compares to yajl by speed under PyPy.
It worth noting that I intentionally left out all the popular choices for parsing JSON on Python like stdlib json and cjson and focused solely on iterative parsing.
Test case
All test results in this article are based on this 5MB JSON array of 10000 objects.
I use an ad-hoc script testjson.py that can be used both to generate the test sample and parse it:
- Generate 10000 item array:
python testjson.py 10000 > test.json
- Measure parsing time:
time python testjson.py < test.json
Tests are run on 64bit Ubuntu on Python 2.7 and PyPy 1.6.
Naive approach
First, I have to confess that I don't know how to write parsers in general. I successfully slept through most of my University course on formal languages so please don't kill me for this code!
The parser implementation consists of everything from the beginning of parse.py up to basic_parse
which along with everything below it is the ijson's public interface. The core parser function is parse_value
which is a Python iterator reading an input stream and yielding events like:
('start_map', None)
('map_key', 'array')
('start_array', None)
('number', 1)
...
The parse_value
recognizes scalar JSON values (numbers, boolean and null constants) and delegates parsing of more complex data to parse_string
, parse_object
and parse_array
. The latter two also call parse_value
recursively.
The naive thing about this code is reading an input stream byte by byte: you can spot read(1)
all over the place. From the beginning I suspected it might be really slow but first I needed a correctly working parser that I could speed up later.
Here are the results of running this code on Python and PyPy compared to original yajl-based code running on Python:
Original yajl wrapper | 0.47 sec |
Python | 3.49 sec |
PyPy | 13.34 sec |
To test the hypothesis of slow byte by byte reading I tried to pre-load the whole file in memory using StringIO and this indeed made a huge difference:
Python, reading from StringIO | 3.05 sec |
PyPy, reading from StringIO | 1.67 sec |
Here, parsing speed under PyPy is actually pretty good compared even to C code. But loading everything in memory is of course not a solution since it kills the idea of iterative parsing.
So I decided to implement some buffering inside the parser.
Buffering parser
The code is here.
The difference is mostly in the Reader
class that wraps an input stream and tries to read it in chunks. It stores the current chunk in self.buffer
and walks through it keeping current parsing position in self.pos
. When the position counter reaches the end of the buffer it replaces it with the new one and resets the counter.
The Reader
also implements two higher level functions: nextchar()
that seeks first non-whitespace character and readuntil()
that reads everything up to some terminator. They are implemented using regexps.
The result surprised me a lot!
Python, buffering in bytearray | 1.10 sec |
PyPy, buffering in bytearray | 89.50 sec (!!!) |
While 3x speed-up under Python was pretty much expected I really have no idea why it slows down so drastically under PyPy…
Update. Upon the advice from Andrey Popp I replaced a bytearray()
buffer with a string and got results on Python about 20% faster and almost as fast on PyPy. So it appears PyPy has some thoroughly inefficient implementation of bytearray()
:
Python, buffering in string | 0.84 sec |
PyPy, buffering in string | 1.30 sec |
And at this moment I've stopped and decided to get some advice.
What's next
Here are all the results again in one table for convenience:
Original yajl wrapper | 0.47 sec |
Direct reading from input stream | |
---|---|
Python | 3.49 sec |
PyPy | 13.34 sec |
Reading from StringIO | |
Python | 3.05 sec |
PyPy | 1.67 sec |
Internal buffering in bytearray with regexp search | |
Python | 1.10 sec |
PyPy | 89.50 sec |
Internal buffering in string with regexp search | |
Python | 0.84 sec |
PyPy | 1.30 sec |
Right now I only have some vague ideas:
-
Redesign the parser as an object accepting chunks of data and calling callback functions. May be PyPy's JIT doesn't like iterators for some reason?
-
Replace regexp search of characters with a sequential in-python search.
-
Work with buffers on a lower level. I suspect that Python code loses much speed on copying bytes in memory with things like
return ''.join(result)
. However this idea goes dangerously close to the point where Python code is so low-level that it's better to just write it in C.
Thoughts?
Comments: 18
Just the same shit about me.
Good old Russian free education;)
Education system is not to blame. It was entirely my fault…
Why using mutable byte array for buffer when you're reallocating it even in nextchar()?
PyPy has it's own not so fast regexp engine. There were also some troubles with stdio and buffering, don't know if they were fixed in 1.6 or later.
P.S. PyPy folks are nice people and like good benchmarks especially when PyPy is that slow. You can tell them about such a drastic results :)
Andrey Popp:
Good point. I used it in early iterations when buffer was mutable. I'm not sure, though, that any immutable structure (a string?) will be necessarily faster. Gotta test…
Ran the tests and updated the article. It's 20% faster on Python and completely removes this horrible drag on PyPy making it just a little slower than Python. Which means we gotta go deeper :-).
nekto0n:
Where's the best place to contact them? IRC? Mailing list? Some Convore board may be? :-)
You can always start with irc.freenode.net #pypy :)
#pypy on freenode is good. We usually sit there and respond to questions.
насчет этой строки - вроде слышал что PyPy нужно для старта/инициализации больше времени чем CPython. Т.е. время лучше замерять не внешней утилитой (time) а непосредственно самим скриптом, т.е. что-то вроде
или с помощью timeit
You can start from bugs.pypy.org, they consider such benchmarks as bug.
Actually, I don't think you need to implement input stream buffering by yourself, Python's io already has such functionality (see docs at http://docs.python.org/library/io.html).
But if I would need to implement it in my code I better allocate fixed-size buffer once — self.buffer = bytearray(BUFSIZE) — and then fill it with data from stream (using file.readinto) when it exhausted.
While things we're talking about don't relate specifically to PyPy or CPython it's still interesting why PyPy got such dramatical performance boost after elimination of bytearray usage for buffering. Can it be because of frequent allocations/deallocations of mutable data structure? Can we measure an impact of GC for runtime?
I need my own buffer because I run regexp searches on it for operations "find first non-whitespace symbol in the stream". And having said this, thanks for pointing me to the
readinto()
, I was hoping something like this should exist :-).Did you end up trying with pypy.rlib.rstring.StringBuilder as a buffer?
No, I didn't use anything PyPy specific. I plan to revisit this whole thing more closely at some point in the future and use it as a sample project to dive deeper into inner workings of PyPy.
Also FWIW I was reading through ijson's code in master branch and appears that buffering code was refactored out completely (if I'm reading it correctly). I plan to be benchmarking against recent ijson myself as well since for my use case JSON input tends to be even more gigantic than the one here. :)
Master branch doesn't have any work described here. It's a straight wrapper around yajl which has its own buffering. If you plan to run it on PyPy it's buffered branch you want.
This will always happen with PyPy. Look at the difference you had in using a byte buffer from PyPy vs cPython. PyPy attempts to abstract the need for a formal language like C underlying your python code. Because of this all operations in PyPy get processed much more, at a higher level. So knowing this the slow result of the PyPy byte buffer makes sense because cPython is interacting with the bytes "closer to the metal" where as PyPy is processing the bytes more. The strings come out about the same though, because unlike bytes (which are primitive types) strings are language specific objects. Therefore both PyPy and cPython have to wrapper every byte read from stream into a python string object.
Bottom line, PyPy makes it easier to understand and work with python libraries because everything is pythonic and not hidden away in compiled C on the back end. However the trade off is that python is much much slower than a complied language like C