Today I've come upon a very interesting development in the story of optimizing pure Python version of ijson. The thing as I left them yesterday were like this:
Original yajl wrapper | 0.47 sec |
CPython | 0.84 sec |
PyPy | 1.30 sec |
These are the times of parsing a JSON array of 10000 objects. The parser reads input stream into plain str
buffers and uses regexps to search for lexems.
Low-hanging fruit
While tweaking various parameters I soon found out that simply shrinking buffer size from 64K to 4K gave some win under PyPy (though not under CPython): it averaged around 1.10 sec.
It's not much but it came cheap.
Warming it up
Then I noticed something more interesting. On a sample of 50000 items, as opposed to 10000, exactly the same code has suddenly become faster on PyPy than on CPython! This got me thinking and my first hypothesis was that it was caused by PyPy's JIT. I don't know much about JITing but I do know that it detects "hot" code paths at runtime and compiles them into fast native code on-the-fly (to me it sounds like it employs some pretty illegal magic). This means that code under JIT should have some warm-up time to get up to speed. So it seemed that given more data PyPy had more time to do statistics and eventually gave better results.
I rewrote my testing script to do a separate warm-up runs of parsing three times in a row and only then measure it.
def parse(filename):
count = 0
for event in ijson.parse(open(filename)):
if event[1] == 'start_map':
count += 1
return count
# ...
print 'Warming up...'
for i in range(3):
parse(sys.argv[1])
print '.'
start = datetime.now()
print 'Objects:', parse(sys.argv[1])
print 'Time:', datetime.now() - start
Here's the result:
Original yajl wrapper | 0.48 sec |
CPython | 0.89 sec |
PyPy | 0.47 sec |
Now those who wanted some easy "proof" that PyPy was cool may go and boast that it's faster than C!
Some thoughts
Reality is of course a bit more complicated.
I use my own yajl wrapper as a reference only because it's the closest thing to test against. But it's absolutely different code, it does more things and I'm sure it covers more corner cases. Also ctypes inevitably hinders performance of the C library itself.
But what we can safely assert is that in the real world it is possible to iteratively parse JSON using pure Python code under PyPy in time comparable to a popular C implementation.
What's even more interesting is that PyPy's JIT adds another dimension to comparison. You cannot just say if some code is slower or faster on PyPy than on CPython because depending on you running it as a command line script or as part of a long living web server process the result may differ completely. It cannot be simplified away, it's just another thing that you have to be aware of as an engineer.
What's next
- Andrey Popp suggested to play with reusing a single allocated buffer instead of reallocating a new one each time.
- I still have in mind to replace regexp lexing with sequential search within a string.
- I want to isolate performance problems of
bytearray
into a simple test case and file a bug in PyPy.
Comments: 7
You can do parsing with FSM to tokenize input and then use this tokens to resemble the content of JSON file, however, I'm not sure which implementation will be faster.
BTW, how it implemented in yajl?
Actually, I don't even have a proper lexing-parsing pipeline. What I mean by "lexing" is "looking for the first interesting symbol in a stream". JSON is simple enough so it's ended up being just two functions:
nextchar()
returns the first non-whitespace symbol andreaduntil()
returns everything up to a terminator:["\\]
for strings and[^0-9.]
for numbers. I have a suspicion that a proper pipeline might indeed be slower…As for yajl, I didn't look at it in details but it looks like it is written properly with lexing and parsing.
What about running yajl under PyPy? Their ctypes should be faster than CPython's ctypes, interesting to see if there is any difference in your case.
In Java wold JVM warm up is a common performance testing technique. Even worse thing about incorrect performance testing within JIT environment is that your results can include runtime compilation time itself.
Proper pipeline implemented on compiled language should be faster then any RegExp implementation, I think.
Seems so :) Code isn't very rocketscientific, could be funny to try to port it to python.
BTW, your engine parses github urls kinda odd.