Wednesday, August 6. 2008
Tonight, just a few minutes after midnight, my server monitoring went havoc and sounded the "red alert" - which means as much as "the server is dead and doesn't respond to anything". And it truly was dead...well, almost, after waiting for like 50 seconds I could SSH into it and do the usual checks - first: take a look at "top" (which is sort of the linux version of the Windows Task manager). And top said that there were like 600 active processes - roughly half of them were Java virtual machines, the other half were apache2 threads - all waiting to get their slice of processing time, which caused the load on the server to shoot up to 163 (actually that's the highest load I've ever seen on a single machine that was not completely dead).
First I thought someone was actively DoSing me, throwing crappy data uploads at the server faster than they could be checked and thrown away. But a quick check revealed that those masses of uploads were actually valid, and they were coming from completely different IP addresses. So it was either a botnet, or...Blizzard! I suddenly remembered that there was some strange text on the login screen yesterday, speaking of "extended maintenance downtime" starting at - you guessed it - midnight. Usually maintenance downtime starts at 3 AM, so there were probably still quite some people playing tonight when the servers went down. The sudden disconnect of thousands of people with the automatic update feature of the MobMapUpdater activated has then caused a massive surge of uploads exactly at midnight. So it was some kind of DDoS attack in the end ;-)
Maybe it wasn't such a good idea to spawn a new JVM for every single upload, which only exists for a split second to run some structural verification checks on the uploaded XML data. Those JVM startups probably produce quite some overhead - resulting in hundreds of processes waiting for execution time when many uploads are being done simultaneously. I might have to see if I can devise a solution for this problem that scales better, maybe by creating a permanently-running verification service that gets called by the PHP upload script - but then this usually does not pose a problem when uploads are evenly distributed over the evening.