Starting Thursday 28 February to March 6 we had performance issues with our master databases. These problems caused data delivery delays for some of our API:s. During the delivery outage we updated our blog, but didn't go into specifics about the errors. In this post we will try to give a more detailed description and what we've done to avoid similar failures in the future.
For our data products, website search and our dynamic widgets, we use a MySQL backend to store all blog posts. After we have fetched a blog and found the posts, these are downloaded and put into a job queue. The documents are then inserted in the master databases.
We have two master databases and four slave servers. All new documents are written to the main databases and then replicated to the slave servers.
On the afternoon of 28 February we got a Nagios warning telling us that the queue was growing, we looked at the system and saw that we couldn't keep up with the all new blogposts being fetched. As we had noticed a similar decrease in the data insertion (without it affecting customers) the previous week, we monitored the system and waited to see how it would handle the load. Since we couldn't detect any hotspots in the system we started looking at the insertion service if there was some locking, but we couldn't find anything.
During the following days we continued debugging and noticed delays in the INSERT queries on the MySQL servers.
Graph showing queued jobs on one of our Kestrel nodes. Each item contains several blog posts.
ZFS & MySQL + dirty pages flush.
The master database servers are running MySQL with a ZFS file system, the databases are InnoDB and are storing a lot of rows (billions) in multiple sharded tables.
MySQL regularly flushes dirty pages to disk, to keep the memory clean.
ZFS has several space groups called metaslabs, each metaslab has a space map that describes that metaslab's free space. The space map is a log of allocations and deallocations. When ZFS needs to allocate space for a metaslab it loads the space log into an in-memory AVL tree, it then allocates new space by appending an allocation. During this process it also looks for allocations and deallocation that cancel each other out, to free more space in that metaslab.
What we saw was that InnoDB had big troubles flushing out the dirty pages to disk, but the disks weren't getting much I/O. After some dtracing we noticed that ZFS was spending a lot of time walking its AVL-trees looking for space.
Our theory is that our ZFS pool had gotten very fragmented and it couldn't handle allocating new space at the rate that InnoDB wanted. This caused extremely slow insert/update performance.
As we started to understand what was happening, we tweaked InnoDB to minimize the I/O during writes, and tried to optimize our ZFS pool. Since most InnoDB fs calls are synchronous, it had to wait until the filesystem had written the records to disk before returning. Due to the I/O congestion we looked into using SSDs as a separate log device in ZFS. The ZIL (ZFS Intent Log) is used to log changes to keep the filesystem consistent in case of a crash. ZFS has great support for moving the ZIL to separate disks to improve write performance.
We installed a pair of mirrored SSDs in each master server, and used these for the ZIL. This immensely improved performance for writes and updates to our MySQL server which gave it enough headroom to purge the dirty pages. After the upgrade we started up the blog insert services and monitored InnoDB dirty pages closely. The solution seemed solid and all was well ... or was it?
After a while we noticed that one of our master MySQL servers were significantly slower than the other. Some more debugging showed that one of the SSD:s was faulty. After changing that SSD to a new one the insertion speed increased to a more normal rate. And after some more close monitoring of MySQL we felt confident that we now had fixed the problem.
Since our database server upgrade we're confident that the system can handle this load without any similar problems.
We started a project just before Christmas to replace all these database servers with a new storage solution that will be more performant and easier to monitor. It will allow us to more easily handle the increased load and enable product development.
We're terribly sorry for the inconvenience that this may have caused our customers. We're working hard to improve our routines and infrastructure to minimize the occurrence of similar events in the future.