Delays in data delivery
Incident Report for Twingly
Postmortem

Starting Thursday 28 February to March 6 we had performance issues with our master databases. These problems caused data delivery delays for some of our API:s. During the delivery outage we updated our blog, but didn't go into specifics about the errors. In this post we will try to give a more detailed description and what we've done to avoid similar failures in the future.

Background

For our data products, website search and our dynamic widgets, we use a MySQL backend to store all blog posts. After we have fetched a blog and found the posts, these are downloaded and put into a job queue. The documents are then inserted in the master databases.

We have two master databases and four slave servers. All new documents are written to the main databases and then replicated to the slave servers.

Outage

On the afternoon of 28 February we got a Nagios warning telling us that the queue was growing, we looked at the system and saw that we couldn't keep up with the all new blogposts being fetched. As we had noticed a similar decrease in the data insertion (without it affecting customers) the previous week, we monitored the system and waited to see how it would handle the load. Since we couldn't detect any hotspots in the system we started looking at the insertion service if there was some locking, but we couldn't find anything.

During the following days we continued debugging and noticed delays in the INSERT queries on the MySQL servers.

Graph showing queued jobs on one of our Kestrel nodes. Each item contains several blog posts.

Cause

ZFS & MySQL + dirty pages flush.

The master database servers are running MySQL with a ZFS file system, the databases are InnoDB and are storing a lot of rows (billions) in multiple sharded tables.

MySQL regularly flushes dirty pages to disk, to keep the memory clean.

ZFS has several space groups called metaslabs, each metaslab has a space map that describes that metaslab's free space. The space map is a log of allocations and deallocations. When ZFS needs to allocate space for a metaslab it loads the space log into an in-memory AVL tree, it then allocates new space by appending an allocation. During this process it also looks for allocations and deallocation that cancel each other out, to free more space in that metaslab.

What we saw was that InnoDB had big troubles flushing out the dirty pages to disk, but the disks weren't getting much I/O. After some dtracing we noticed that ZFS was spending a lot of time walking its AVL-trees looking for space.

Our theory is that our ZFS pool had gotten very fragmented and it couldn't handle allocating new space at the rate that InnoDB wanted. This caused extremely slow insert/update performance.

As we started to understand what was happening, we tweaked InnoDB to minimize the I/O during writes, and tried to optimize our ZFS pool. Since most InnoDB fs calls are synchronous, it had to wait until the filesystem had written the records to disk before returning. Due to the I/O congestion we looked into using SSDs as a separate log device in ZFS. The ZIL (ZFS Intent Log) is used to log changes to keep the filesystem consistent in case of a crash. ZFS has great support for moving the ZIL to separate disks to improve write performance.

Solution

We installed a pair of mirrored SSDs in each master server, and used these for the ZIL. This immensely improved performance for writes and updates to our MySQL server which gave it enough headroom to purge the dirty pages. After the upgrade we started up the blog insert services and monitored InnoDB dirty pages closely. The solution seemed solid and all was well ... or was it?

SSDs

After a while we noticed that one of our master MySQL servers were significantly slower than the other. Some more debugging showed that one of the SSD:s was faulty. After changing that SSD to a new one the insertion speed increased to a more normal rate. And after some more close monitoring of MySQL we felt confident that we now had fixed the problem.

The future

Since our database server upgrade we're confident that the system can handle this load without any similar problems.

We started a project just before Christmas to replace all these database servers with a new storage solution that will be more performant and easier to monitor. It will allow us to more easily handle the increased load and enable product development.

We're terribly sorry for the inconvenience that this may have caused our customers. We're working hard to improve our routines and infrastructure to minimize the occurrence of similar events in the future.

Posted Aug 07, 2013 - 11:06 CEST

Resolved
Everything is back to normal and all documents that was queued up during the week are now indexed and available. We’re terrible sorry for this incident and we will post a technical summary of what caused this.
Posted Mar 08, 2013 - 11:38 CET
Update
Still inserting content from the queue, but we’re seeing much better performance. There is still a small lag in the data.
Posted Mar 07, 2013 - 08:05 CET
Monitoring
Both master databases are now running with good performance, we are in the process of inserting all queued content now.
Posted Mar 06, 2013 - 23:33 CET
Update
Still lag, but one (out of two) of the master databases are running at full speed. We are going to do the same upgrade to the last master database around 11:00.
Posted Mar 06, 2013 - 10:46 CET
Update
Master database is up and running and we’re inserting all documents in the queue. Still a lag in the data.
Posted Mar 06, 2013 - 08:37 CET
Update
We are restarting the master database for some performance tweaks, no new content delivered at the moment.
Posted Mar 05, 2013 - 17:12 CET
Update
The master database are being populated with new content, but there’s still a big lag.
Posted Mar 05, 2013 - 11:03 CET
Update
As of an hour ago we’re slowly returning to normal operations, new content is being delivered, all services should be back to normal speed in 12-24 hours.
Posted Mar 05, 2013 - 10:32 CET
Update
The new master databases are now online and being updated with documents.
Posted Mar 04, 2013 - 19:33 CET
Identified
We’re rolling out a new version of the master database.
Posted Mar 04, 2013 - 18:47 CET
Update
We’re experiencing performance issues again, we’re shutting down the master database for more maintenance, no new data will be delivered.
Posted Mar 04, 2013 - 18:11 CET
Update
Master database is now online, we’re inserting all queued documents.
Posted Mar 04, 2013 - 14:24 CET
Update
Still in maintenance mode.
Posted Mar 04, 2013 - 13:37 CET
Update
We are shutting down the master database for maintenance, Twingly Search and APIs will continue to work but no new content will be available.
Posted Mar 04, 2013 - 12:05 CET
Investigating
We’re experiencing delays in data delivery, no data is lost but there is a delay up to 72 hours for the data to be delivered through the API.

Starting this friday the performance of the main data storage started to degrade, which resulted in new documents being put on queue waiting to be saved.

We’ll be updating this post as we debug this problem.
Posted Mar 04, 2013 - 11:09 CET