Blog data delays

Incident Report for Twingly

Postmortem

We have been in the process of upgrading our main MySQL servers from FreeBSD 10.3 to 11.1. First we started with our secondary server, which was verified to work as expected in its current role (replication and read-only queries). There was a slight hiccup after the upgrade completed, in that our database ZFS pool wasn't mounted automatically. This turned out to be more problematic than we initially thought.

On 2018-04-24 we set out to switch to the upgraded server as our primary MySQL server, thereby handling all the writes.

The switch in itself went fine, but shortly thereafter, we were alerted by our monitoring of an unusual amount of documents on DLQ. We quickly found out that the reason for this was exceptions from our service writing documents to MySQL.

Further investigation showed that the new MySQL primary wasn't configured with the correct sql_mode, causing writes to fail. The problem stemmed from our earlier problem with the missing ZFS pool. When FreeBSD started MySQL on that boot, it could not find the MySQL datadir, therefore it ran mysql_install_db, which created an MySQL options file with an sql_mode value different from the default value.

The issue did not manifest itself in our integration tests and required production write traffic to hit the database.

To ensure this does not happen again, we have explicitly defined sql_mode in our ordinary options file, and included additional tasks in our deployment playbooks to ensure only our config file exist.

Posted May 11, 2018 - 12:19 CEST

Resolved

The issue has been resolved and the blog data ingestion system is running at full speed.

Posted Apr 24, 2018 - 19:10 CEST

Identified

There are currently some delays in our blog data delivery. We are working on solving the issue and will update this incident as soon as we know more.

Posted Apr 24, 2018 - 18:08 CEST

This incident affected: Blog Search API v3, Blog LiveFeed API v5, and Blog Data collection.