ijson on PyPy

I happen to follow Alex Gaynor on Twitter and his ravings on speed and general awesomeness of PyPy have inspired me to a small experiment. I have this iterative JSON parser — ijson — which is a ctypes-based wrapper around C library yajl by Lloyd Hilaiel. I don't know of any pure Python iterative JSON parsers so I decided to implement one and see how it compares to yajl by speed under PyPy.

It worth noting that I intentionally left out all the popular choices for parsing JSON on Python like stdlib json and cjson and focused solely on iterative parsing.

Test case

All test results in this article are based on this 5MB JSON array of 10000 objects.

I use an ad-hoc script testjson.py that can be used both to generate the test sample and parse it:

Generate 10000 item array: python testjson.py 10000 > test.json
Measure parsing time: time python testjson.py < test.json

Tests are run on 64bit Ubuntu on Python 2.7 and PyPy 1.6.

Naive approach

First, I have to confess that I don't know how to write parsers in general. I successfully slept through most of my University course on formal languages so please don't kill me for this code!

The parser implementation consists of everything from the beginning of parse.py up to basic_parse which along with everything below it is the ijson's public interface. The core parser function is parse_value which is a Python iterator reading an input stream and yielding events like:

('start_map', None)
('map_key', 'array')
('start_array', None)
('number', 1)
...

The parse_value recognizes scalar JSON values (numbers, boolean and null constants) and delegates parsing of more complex data to parse_string, parse_object and parse_array. The latter two also call parse_value recursively.

The naive thing about this code is reading an input stream byte by byte: you can spot read(1) all over the place. From the beginning I suspected it might be really slow but first I needed a correctly working parser that I could speed up later.

Here are the results of running this code on Python and PyPy compared to original yajl-based code running on Python:

Original yajl wrapper	0.47 sec
Python	3.49 sec
PyPy	13.34 sec

To test the hypothesis of slow byte by byte reading I tried to pre-load the whole file in memory using StringIO and this indeed made a huge difference:

Python, reading from StringIO	3.05 sec
PyPy, reading from StringIO	1.67 sec

Here, parsing speed under PyPy is actually pretty good compared even to C code. But loading everything in memory is of course not a solution since it kills the idea of iterative parsing.

So I decided to implement some buffering inside the parser.

Buffering parser

The code is here.

The difference is mostly in the Reader class that wraps an input stream and tries to read it in chunks. It stores the current chunk in self.buffer and walks through it keeping current parsing position in self.pos. When the position counter reaches the end of the buffer it replaces it with the new one and resets the counter.

The Reader also implements two higher level functions: nextchar() that seeks first non-whitespace character and readuntil() that reads everything up to some terminator. They are implemented using regexps.

The result surprised me a lot!

Python, buffering in bytearray	1.10 sec
PyPy, buffering in bytearray	89.50 sec (!!!)

While 3x speed-up under Python was pretty much expected I really have no idea why it slows down so drastically under PyPy…

Update. Upon the advice from Andrey Popp I replaced a bytearray() buffer with a string and got results on Python about 20% faster and almost as fast on PyPy. So it appears PyPy has some thoroughly inefficient implementation of bytearray():

Python, buffering in string	0.84 sec
PyPy, buffering in string	1.30 sec

And at this moment I've stopped and decided to get some advice.

What's next

Here are all the results again in one table for convenience:

Direct reading from input stream
Original yajl wrapper	0.47 sec
Python	3.49 sec
PyPy	13.34 sec
Reading from StringIO
Python	3.05 sec
PyPy	1.67 sec
Internal buffering in bytearray with regexp search
Python	1.10 sec
PyPy	89.50 sec
Internal buffering in string with regexp search
Python	0.84 sec
PyPy	1.30 sec

Right now I only have some vague ideas:

Redesign the parser as an object accepting chunks of data and calling callback functions. May be PyPy's JIT doesn't like iterators for some reason?
Replace regexp search of characters with a sequential in-python search.
Work with buffers on a lower level. I suspect that Python code loses much speed on copying bytes in memory with things like return ''.join(result). However this idea goes dangerously close to the point where Python code is so low-level that it's better to just write it in C.

Thoughts?

Comments: 18

dark-est

I successfully slept through most of my University course on formal languages.

Just the same shit about me.

Good old Russian free education;)

Ivan Sagalaev

Education system is not to blame. It was entirely my fault…

Andrey Popp

Why using mutable byte array for buffer when you're reallocating it even in nextchar()?

nekto0n

PyPy has it's own not so fast regexp engine. There were also some troubles with stdio and buffering, don't know if they were fixed in 1.6 or later.

P.S. PyPy folks are nice people and like good benchmarks especially when PyPy is that slow. You can tell them about such a drastic results :)

Ivan Sagalaev

Andrey Popp:

Why using mutable byte array for buffer when you're reallocating it even in nextchar()?

Good point. I used it in early iterations when buffer was mutable. I'm not sure, though, that any immutable structure (a string?) will be necessarily faster. Gotta test…

Ran the tests and updated the article. It's 20% faster on Python and completely removes this horrible drag on PyPy making it just a little slower than Python. Which means we gotta go deeper :-).

Ivan Sagalaev

nekto0n:

P.S. PyPy folks are nice people and like good benchmarks especially when PyPy is that slow. You can tell them about such a drastic results :)

Where's the best place to contact them? IRC? Mailing list? Some Convore board may be? :-)

nekto0n

Where's the best place to contact them?

You can always start with irc.freenode.net #pypy :)

fijal

#pypy on freenode is good. We usually sit there and respond to questions.

seriyPS

time python testjson.py < test.json

насчет этой строки - вроде слышал что PyPy нужно для старта/инициализации больше времени чем CPython. Т.е. время лучше замерять не внешней утилитой (time) а непосредственно самим скриптом, т.е. что-то вроде

start=time.time()
for event in ijson.parse(sys.stdin):
    if event[1] == 'start_map':
        count += 1
print time.time()-start

или с помощью timeit

hailtjuppa

You can start from bugs.pypy.org, they consider such benchmarks as bug.

Andrey Popp

Good point. I used it in early iterations when buffer was mutable. I'm not sure, though, that any immutable structure (a string?) will be necessarily faster. Gotta test…

Actually, I don't think you need to implement input stream buffering by yourself, Python's io already has such functionality (see docs at http://docs.python.org/library/io.html).

But if I would need to implement it in my code I better allocate fixed-size buffer once — self.buffer = bytearray(BUFSIZE) — and then fill it with data from stream (using file.readinto) when it exhausted.

While things we're talking about don't relate specifically to PyPy or CPython it's still interesting why PyPy got such dramatical performance boost after elimination of bytearray usage for buffering. Can it be because of frequent allocations/deallocations of mutable data structure? Can we measure an impact of GC for runtime?

Ivan Sagalaev

I need my own buffer because I run regexp searches on it for operations "find first non-whitespace symbol in the stream". And having said this, thanks for pointing me to the readinto(), I was hoping something like this should exist :-).

Ivan Sagalaev

Today I've come upon a very interesting development in the story of optimizing pure Python version of ijson. The thing as I left them yesterday were like this:

nanonyme

Did you end up trying with pypy.rlib.rstring.StringBuilder as a buffer?

Ivan Sagalaev

No, I didn't use anything PyPy specific. I plan to revisit this whole thing more closely at some point in the future and use it as a sample project to dive deeper into inner workings of PyPy.

nanonyme

Also FWIW I was reading through ijson's code in master branch and appears that buffering code was refactored out completely (if I'm reading it correctly). I plan to be benchmarking against recent ijson myself as well since for my use case JSON input tends to be even more gigantic than the one here. :)

Ivan Sagalaev

Master branch doesn't have any work described here. It's a straight wrapper around yajl which has its own buffering. If you plan to run it on PyPy it's buffered branch you want.

Jamie

This will always happen with PyPy. Look at the difference you had in using a byte buffer from PyPy vs cPython. PyPy attempts to abstract the need for a formal language like C underlying your python code. Because of this all operations in PyPy get processed much more, at a higher level. So knowing this the slow result of the PyPy byte buffer makes sense because cPython is interacting with the bytes "closer to the metal" where as PyPy is processing the bytes more. The strings come out about the same though, because unlike bytes (which are primitive types) strings are language specific objects. Therefore both PyPy and cPython have to wrapper every byte read from stream into a python string object.

Bottom line, PyPy makes it easier to understand and work with python libraries because everything is pythonic and not hidden away in compiled C on the back end. However the trade off is that python is much much slower than a complied language like C

dark-est

I successfully slept through most of my University course on formal languages.

Just the same shit about me.

Good old Russian free education;)
Ivan Sagalaev

Education system is not to blame. It was entirely my fault…
Andrey Popp

Why using mutable byte array for buffer when you're reallocating it even in nextchar()?
nekto0n

PyPy has it's own not so fast regexp engine. There were also some troubles with stdio and buffering, don't know if they were fixed in 1.6 or later.

P.S. PyPy folks are nice people and like good benchmarks especially when PyPy is that slow. You can tell them about such a drastic results :)
Ivan Sagalaev

Andrey Popp:

Why using mutable byte array for buffer when you're reallocating it even in nextchar()?

Good point. I used it in early iterations when buffer was mutable. I'm not sure, though, that any immutable structure (a string?) will be necessarily faster. Gotta test…

Ran the tests and updated the article. It's 20% faster on Python and completely removes this horrible drag on PyPy making it just a little slower than Python. Which means we gotta go deeper :-).
Ivan Sagalaev

nekto0n:

P.S. PyPy folks are nice people and like good benchmarks especially when PyPy is that slow. You can tell them about such a drastic results :)

Where's the best place to contact them? IRC? Mailing list? Some Convore board may be? :-)
nekto0n

Where's the best place to contact them?

You can always start with irc.freenode.net #pypy :)
fijal

#pypy on freenode is good. We usually sit there and respond to questions.
seriyPS
```
time python testjson.py < test.json
```
насчет этой строки - вроде слышал что PyPy нужно для старта/инициализации больше времени чем CPython. Т.е. время лучше замерять не внешней утилитой (time) а непосредственно самим скриптом, т.е. что-то вроде
```
start=time.time()
for event in ijson.parse(sys.stdin):
    if event[1] == 'start_map':
        count += 1
print time.time()-start
```
или с помощью timeit
hailtjuppa

You can start from bugs.pypy.org, they consider such benchmarks as bug.
Andrey Popp

Good point. I used it in early iterations when buffer was mutable. I'm not sure, though, that any immutable structure (a string?) will be necessarily faster. Gotta test…

Actually, I don't think you need to implement input stream buffering by yourself, Python's io already has such functionality (see docs at http://docs.python.org/library/io.html).

But if I would need to implement it in my code I better allocate fixed-size buffer once — self.buffer = bytearray(BUFSIZE) — and then fill it with data from stream (using file.readinto) when it exhausted.

While things we're talking about don't relate specifically to PyPy or CPython it's still interesting why PyPy got such dramatical performance boost after elimination of bytearray usage for buffering. Can it be because of frequent allocations/deallocations of mutable data structure? Can we measure an impact of GC for runtime?
Ivan Sagalaev

I need my own buffer because I run regexp searches on it for operations "find first non-whitespace symbol in the stream". And having said this, thanks for pointing me to the readinto(), I was hoping something like this should exist :-).
Ivan Sagalaev

Today I've come upon a very interesting development in the story of optimizing pure Python version of ijson. The thing as I left them yesterday were like this:
nanonyme

Did you end up trying with pypy.rlib.rstring.StringBuilder as a buffer?
Ivan Sagalaev

No, I didn't use anything PyPy specific. I plan to revisit this whole thing more closely at some point in the future and use it as a sample project to dive deeper into inner workings of PyPy.
nanonyme

Also FWIW I was reading through ijson's code in master branch and appears that buffering code was refactored out completely (if I'm reading it correctly). I plan to be benchmarking against recent ijson myself as well since for my use case JSON input tends to be even more gigantic than the one here. :)
Ivan Sagalaev

Master branch doesn't have any work described here. It's a straight wrapper around yajl which has its own buffering. If you plan to run it on PyPy it's buffered branch you want.
Jamie

This will always happen with PyPy. Look at the difference you had in using a byte buffer from PyPy vs cPython. PyPy attempts to abstract the need for a formal language like C underlying your python code. Because of this all operations in PyPy get processed much more, at a higher level. So knowing this the slow result of the PyPy byte buffer makes sense because cPython is interacting with the bytes "closer to the metal" where as PyPy is processing the bytes more. The strings come out about the same though, because unlike bytes (which are primitive types) strings are language specific objects. Therefore both PyPy and cPython have to wrapper every byte read from stream into a python string object.

Bottom line, PyPy makes it easier to understand and work with python libraries because everything is pythonic and not hidden away in compiled C on the back end. However the trade off is that python is much much slower than a complied language like C

ijson on PyPy

Test case

Naive approach

Buffering parser

What's next

Comments: 18

Add comment