ijson on PyPy, Episode 2: Warm-up

Today I've come upon a very interesting development in the story of optimizing pure Python version of ijson. The thing as I left them yesterday were like this:

Original yajl wrapper	0.47 sec
CPython	0.84 sec
PyPy	1.30 sec

These are the times of parsing a JSON array of 10000 objects. The parser reads input stream into plain str buffers and uses regexps to search for lexems.

Low-hanging fruit

While tweaking various parameters I soon found out that simply shrinking buffer size from 64K to 4K gave some win under PyPy (though not under CPython): it averaged around 1.10 sec.

It's not much but it came cheap.

Warming it up

Then I noticed something more interesting. On a sample of 50000 items, as opposed to 10000, exactly the same code has suddenly become faster on PyPy than on CPython! This got me thinking and my first hypothesis was that it was caused by PyPy's JIT. I don't know much about JITing but I do know that it detects "hot" code paths at runtime and compiles them into fast native code on-the-fly (to me it sounds like it employs some pretty illegal magic). This means that code under JIT should have some warm-up time to get up to speed. So it seemed that given more data PyPy had more time to do statistics and eventually gave better results.

I rewrote my testing script to do a separate warm-up runs of parsing three times in a row and only then measure it.

def parse(filename):
    count = 0
    for event in ijson.parse(open(filename)):
        if event[1] == 'start_map':
            count += 1
    return count

# ...

    print 'Warming up...'
    for i in range(3):
        parse(sys.argv[1])
        print '.'
    start = datetime.now()
    print 'Objects:', parse(sys.argv[1])
    print 'Time:', datetime.now() - start

Here's the result:

Original yajl wrapper	0.48 sec
CPython	0.89 sec
PyPy	0.47 sec

Now those who wanted some easy "proof" that PyPy was cool may go and boast that it's faster than C!

Some thoughts

Reality is of course a bit more complicated.

I use my own yajl wrapper as a reference only because it's the closest thing to test against. But it's absolutely different code, it does more things and I'm sure it covers more corner cases. Also ctypes inevitably hinders performance of the C library itself.

But what we can safely assert is that in the real world it is possible to iteratively parse JSON using pure Python code under PyPy in time comparable to a popular C implementation.

What's even more interesting is that PyPy's JIT adds another dimension to comparison. You cannot just say if some code is slower or faster on PyPy than on CPython because depending on you running it as a command line script or as part of a long living web server process the result may differ completely. It cannot be simplified away, it's just another thing that you have to be aware of as an engineer.

What's next

Andrey Popp suggested to play with reusing a single allocated buffer instead of reallocating a new one each time.
I still have in mind to replace regexp lexing with sequential search within a string.
I want to isolate performance problems of bytearray into a simple test case and file a bug in PyPy.

Comments: 7

Dmitry

You can do parsing with FSM to tokenize input and then use this tokens to resemble the content of JSON file, however, I'm not sure which implementation will be faster.

BTW, how it implemented in yajl?
Ivan Sagalaev

Actually, I don't even have a proper lexing-parsing pipeline. What I mean by "lexing" is "looking for the first interesting symbol in a stream". JSON is simple enough so it's ended up being just two functions: nextchar() returns the first non-whitespace symbol and readuntil() returns everything up to a terminator: ["\\] for strings and [^0-9.] for numbers. I have a suspicion that a proper pipeline might indeed be slower…

As for yajl, I didn't look at it in details but it looks like it is written properly with lexing and parsing.
Alexander Solovyov

What about running yajl under PyPy? Their ctypes should be faster than CPython's ctypes, interesting to see if there is any difference in your case.
Google user

In Java wold JVM warm up is a common performance testing technique. Even worse thing about incorrect performance testing within JIT environment is that your results can include runtime compilation time itself.
Dmitry

Proper pipeline implemented on compiled language should be faster then any RegExp implementation, I think.
Dmitry
- https://github.com/lloyd/yajl/blob/master/src/yajl_lex.c#L499
- https://github.com/lloyd/yajl/blob/master/src/yajl_parser.c#L193
Seems so :) Code isn't very rocketscientific, could be funny to try to port it to python.

BTW, your engine parses github urls kinda odd.
Ivan Sagalaev

Tweaking the old pure python branch into a backend inspired me to run again some performance test that I did a year ago. Since this time I used a larger data sample and a modified test script the results aren't directly comparable to the old ones. I was interested in one thing in particular: how the pure python parser running under PyPy compares to the yajl-based parser running under CPython, the latter being the most obvious setup currently.