Today I've come upon a very interesting development in the story of optimizing pure Python version of ijson. The thing as I left them yesterday were like this:
|Original yajl wrapper||0.47 sec|
These are the times of parsing a JSON array of 10000 objects. The parser reads input stream into plain
str buffers and uses regexps to search for lexems.
While tweaking various parameters I soon found out that simply shrinking buffer size from 64K to 4K gave some win under PyPy (though not under CPython): it averaged around 1.10 sec.
It's not much but it came cheap.
Warming it up
Then I noticed something more interesting. On a sample of 50000 items, as opposed to 10000, exactly the same code has suddenly become faster on PyPy than on CPython! This got me thinking and my first hypothesis was that it was caused by PyPy's JIT. I don't know much about JITing but I do know that it detects "hot" code paths at runtime and compiles them into fast native code on-the-fly (to me it sounds like it employs some pretty illegal magic). This means that code under JIT should have some warm-up time to get up to speed. So it seemed that given more data PyPy had more time to do statistics and eventually gave better results.
I rewrote my testing script to do a separate warm-up runs of parsing three times in a row and only then measure it.
def parse(filename): count = 0 for event in ijson.parse(open(filename)): if event == 'start_map': count += 1 return count # ... print 'Warming up...' for i in range(3): parse(sys.argv) print '.' start = datetime.now() print 'Objects:', parse(sys.argv) print 'Time:', datetime.now() - start
Here's the result:
|Original yajl wrapper||0.48 sec|
Now those who wanted some easy "proof" that PyPy was cool may go and boast that it's faster than C!
Reality is of course a bit more complicated.
I use my own yajl wrapper as a reference only because it's the closest thing to test against. But it's absolutely different code, it does more things and I'm sure it covers more corner cases. Also ctypes inevitably hinders performance of the C library itself.
But what we can safely assert is that in the real world it is possible to iteratively parse JSON using pure Python code under PyPy in time comparable to a popular C implementation.
What's even more interesting is that PyPy's JIT adds another dimension to comparison. You cannot just say if some code is slower or faster on PyPy than on CPython because depending on you running it as a command line script or as part of a long living web server process the result may differ completely. It cannot be simplified away, it's just another thing that you have to be aware of as an engineer.
- Andrey Popp suggested to play with reusing a single allocated buffer instead of reallocating a new one each time.
- I still have in mind to replace regexp lexing with sequential search within a string.
- I want to isolate performance problems of
bytearrayinto a simple test case and file a bug in PyPy.