Malcolm Tredinnick asked me to translate into English my Friday's post about our experience of launching a Django project under high load. So here it is, hope this will be useful :-)
I believe some context is also required. Yandex is the biggest Russian search engine and service portal that started to look into building its services using Django last year. I'm working there as a lead architect on the interesting project "Where everyone goes". This is a social network for people to organize themselves to spend some leisure time together: movies, art galleries, concerts, clubs, etc. This is by far our biggest Django project in Yandex. Last Thursday we've crashed under a stream of traffic coming from a teaser on our mothership's index page. Now read on for the rest of a "war story"!
As many attentive readers of this blog noticed :-), the morning after launching a teaser our wonderful service showed a strange face and refused to work until the teaser was removed. So yesterday and today we were trying to figure out what went wrong with it. And it looks like we've succeeded :-). I am now in an interesting mental state of wondering why it was so bad and hopeless that Thursday while now it's all so simple and obvious... However only the next teaser can prove us right.
First of all some numbers for the sake of statistics. Though as you'll see later they don't mean that much.
We're running on a cluster of 4 machines (CPU Xeon, 4 cores, 2.3 GHz) where we have litgttpd, Django and memcached. On the back there is a single DB server running MySQL. We managed our traffic well all the night and most of the morning when we had about 55 requests/second on one host. Then between 10 and 11 am. we started to feel worse while the traffic has grown to more than 300 requests/second after which we've... oh, well :-(
During the whole Thursday many participants of the process had many various ideas about the issue. But to the middle of the day today one of them had crystallized and now I'm pretty sure that it was the main reason of such a bad performance.
The problem was in the usage of sessions (not in them per se though but in the way we used them). These are very standard sessions that come with Django. In general Django's sessions are made smart enough -- while stored in a DB they don't touch it at all if your application doesn't write anything in them. Unfortunately we used them in an interesting manner. We store there one-off user messages that are shown once and then removed. For this we had this code in a context processor:
messages = request.session['messages'] request.session['messages'] =  # ← a killer line!!!
The second line here clears messages by effectively writing to a session. And since it is in a context processor it was executed on every request.
Well, not exactly on every request but rather only on those when we get a new user. However since we had a teaser on Yandex's index page practically all our users were new.
One may, nevertheless, suppose that even at the rate of 1200 requests/sec MySQL on a good hardware should handle the task. There were two particular features of the session table that didn't let it happen:
If I get this thing right, InnoDB tables physically rebuild themselves upon each new inserted record according to the order of the primary key. This is why, evidently, they are being queried fast. And in the usual case with auto-incremented integer key this doesn't cause any problems since a new record is added at the end right where it should be anyway. But since we have a key that is random we had each new record inserted somewhere in the middle which caused reconstruction of the table. It is being reconstructed in slices which given such 'chaotic' writes leads to fragmentation which slows writing more and more. Our admins say that at the worst moments a single write in a session table had taken about 6-7 seconds!
Finally, at some moment the growth of quantity of new requests has become faster then the speed of handling them. All attempts to revive the service were hence futile, because the further we went, the slower it was running. This is what I meant saying that performance characteristics don't mean much: it's useless to measure speed with such an anchor.
So we had two lessons out of this (surely pretty obvious for many):
By the way... The bitter irony of the story is that we in fact don't use this messages subsystem. The service was killed by a feature that didn't exist.
We have brought the service back to life by moving sessions into an in-memory table (and also by removing the teaser :-) ). But in-memory table didn't let sessions live very long either: it has quickly outgrown some storage space and in the evening our service has crashed again. So we moved them back on disk. And now we just got rid of that "killer line" altogether.
But an overloaded database allowed us to see other problems. Though they aren't nearly as severe as the first one we will fix them anyway because sooner or later we'll run into them as our audience will grow.
Problems we learnt were:
"Dog-pile" effect. Our index page consists mostly of cached blocks that are produced using relatively heavy queries. When the cache record becomes stale the next request would cause its regeneration. And while this regeneration wasn't complete all subsequent requests would execute these same queries for regeneration. If the regeneration time is long enough and the number of requests is big enough they would create an additional load that further slows down the regeneration speed making the situation gradually worse.
We can't read from DB replicas yet which would otherwise allowed us to relieve the master that was busy writing sessions and would allowed as to stay a little longer :-).
And also one of the queries on index page involved a join of four table one of which is the largest table in the database :-).
Yesterday everything looked very bad: we're limited by performance of database writes which means in the world of relational DBMS that we can't scale wide (i.e. by adding machines). We have to rewrite something and artificially partition the database somehow and this is hard and painful. But today just by removing writing to sessions we managed to turn the situation upside down. On a test stand consisting of one frontend and one DB backend we can load the frontend up to 80 LA while the DB is loaded up to 2-2.5 LA. Which means that now we don't have a bottleneck on the database. We can just add frontends and I think we won't overload database very soon. We'll try to evaluate this more precisely on Tuesday when we plan to stress-load the system of two frontends and one database.
To solve the second-class problems we plan in particular:
Watch for new episodes!