Tuesday, September 20, 2016

Lessons from Deploying MySQL GTID at Scale (at Facebook)

Global Transaction ID (GTID) is one of the most compelling new features of MySQL 5.6. It provides major benefits in failover, point-in-time backup recovery, and hierarchical replication, and it's a prerequisite for crash-safe multi-threaded replication. Over the course of the last few months, we enabled GTID on every production MySQL instance at Facebook. In the process, we learned a great deal about deployment and operational use of the feature. We plan to open source many of our server-side fixes via WebScaleSQL, as we believe others in the scale community can learn from this and benefit from the work we've done.

Background

Traditional MySQL replication is based on relative coordinates — each replica keeps track of its position with respect to its current master's binary log files. GTID enhances this setup by assigning a unique identifier to every transaction, and each MySQL server keeps track of which transactions it has already executed. This permits "auto-positioning," the ability for a replica to be pointed at a master instance without needing to specify a binlog filename or position in the CHANGE MASTER statement.

Auto-positioning makes failover simpler, faster, and less error-prone. It becomes trivial to get replicas in sync after a master failure, without requiring an external tool such as Master High Availability (MHA). Planned master promotions also become easier, as it is no longer necessary to stop all replicas at the same position first. Database administrators need not worry about manually specifying incorrect positions; even in the case of human error, the server is now smart enough to ignore transactions it has already executed.

By permitting replicas to be repointed to masters at different levels of the hierarchy, GTID greatly simplifies complex replication topologies, including hierarchical replication (slaves of slaves). Since a GTID-enabled binlog stream can safely be taken from any member of a replica set, as well as replayed without requiring relative positions, the feature also eases binlog backup and recovery.

Additionally, by combining GTID with semi-synchronous replication, we have implemented automation to safely recover crashed masters as replicas. When a master crashes, we can detect this and promote a replica within 30 seconds without losing data. Later, if the original master was able to be recovered and our automation detects its data is consistent, GTID allows us to repoint it to the new master instead of having to kick off a copy operation to replace it.

Deployment

GTID in MySQL 5.6 is extremely difficult to deploy to an existing large-scale environment. The gtid_mode variable is not dynamic, and also must match between master and replicas. The official deployment plan in the MySQL documentation involves making a master read_only, shutting down all MySQL instances in the replica set simultaneously at the same position, enabling the gtid_mode variable in my.cnf, and then starting all of the instances. This process is fundamentally incompatible with the notion of high availability, making it unviable for production use at scale.

We worked around this problem in fb-mysql by relaxing the constraint that gtid_mode must match between master and replicas. We permit a gtid_mode=ON replica to have a gtid_mode=OFF master, and we suppress assignment of GTIDs to transactions on servers with read_only=ON (i.e., replicas). This permits a high-availability deployment strategy as follows, for each replica set:
  1. On each replica, one at a time, restart MySQL to enable gtid_mode. Afterward, we are in a state where every replica has gtid_mode=ON, but the master still has gtid_mode=OFF.
  2. Perform a master promotion as normal, repointing the replicas and original master to a new master. The original master's replication will intentionally break when started, since it still has gtid_mode=OFF.
  3. Restart the original master to enable gtid_mode. It will now be able to replicate from the new master, and the entire replica set now has gtid_mode=ON.
With sufficient safeguards and validation logic, it is safe to execute this rollout process to a large number of replica sets at a time. During the peak of the deployment process, we were running our rollout script on up to hundreds of replica sets simultaneously.

fb-mysql changes

To deploy GTID and make it work in our production environment, we had to make huge changes both in the MySQL server and in the automation around it. Apart from the deployment changes, during initial testing we have encountered a number of serious bugs and performance regressions with GTID.

In some cases, the MySQL server ended up scanning all binary logs whenever a replica connected with the new auto-position protocol, or during crash recovery, to initialize the global variables GTID_PURGED and GTID_EXECUTED. Opening the binary logs is required to read the previous_gtid_events present at the beginning of the binary log. In fb-mysql, we fixed these issues by changing the format of binary log index file to include previous_gtid_event corresponding to each binary log file name. Fb-mysql uses the previous_gtid_events present in binary log index file instead of opening the binary logs directly thus improving performance.

We also found that slaves with GTID are not crash-safe with less durable settings (sync_binlog !=1 and innodb_flush_log_at_trx_commit != 1). Using fully durable settings requires syncing both the binary log and innodb transaction log to disk after each transaction in single-threaded replication mode, which negatively affects slave apply performance. It is important for any feature to be crash-safe to avoid operational overhead at Facebook scale. So in fb-mysql, we decided to fix this issue by adding a new transaction table (mysql.slave_gtid_info) to consistently store the GTID information.

GTID is a powerful feature that simplifies many replication complexities. For example, we identified multi-threaded slave doesn’t work along with relay_log_recovery=1, but relay_log_recovery=1 is required for crash safety even with less durable replication settings (sync_relay_log != 1). In fb-mysql we allowed the use of relay_log_recovery=1 with multi-threaded slave when GTID is enabled, since gaps in execution created after the multi-threaded slave crash are automatically filled by GTID auto-positioning.

Preparation

There were several steps we had to take prior to beginning our GTID deployment. One major step involved updating all of our automation to use GTID and auto-positioning. Enabling gtid_mode without use of auto-positioning is detrimental to replica crash-safety, so it is important to roll out both at once. The most substantial change was to our promotion logic, which now had to cover additional permutations for whether GTID was already enabled, or being enabled for the first time.

Another important prerequisite involves prevention of GTID-incompatible statements. MySQL has an option, enforce_gtid_consistency, that causes these statements to generate an error. For safety's sake, this should always be enabled whenever gtid_mode is also enabled. However, before beginning the rollout, it is necessary to audit applications and preemptively fix any uses of these query patterns. To make this possible at our scale, we augmented MySQL to add user stat counters for these statements, as well as an option to write full information on them to the MySQL error log. This allowed us to easily identify around 20 cases of these query patterns being used, among our thousands of special-case workloads.

Finally, we wrote a script to aid in skipping statements, in the rare cases where that is necessary. The age-old variable sql_slave_skip_counter does not work with gtid_mode enabled; instead, a DBA must fiddle with the gtid_next variable and insert an empty transaction. This is painful in an emergency, especially while a large DBA team is still ramping up on GTID knowledge, so having a helper script is prudent.

For a more in-depth technical overview of our GTID-related MySQL changes and automation efforts, please see our slides from Percona Live MySQL Conference 2014.

Summary

Facebook's Global Transaction ID deployment was a cross-functional collaboration between our MySQL engineering, database operations, and data performance teams. Deploying GTID to a Facebook-scale environment required substantial effort, including major improvements to the MySQL server, changes to our automation, and a custom rollout script. Early on in the rollout process, we also uncovered, reported, and fixed several bugs and performance issues in MySQL relating to GTID support. We can happily state that it is now extremely stable in our use, with no new problems encountered in recent months. We plan to make these same fixes to the WebScaleSQL branch of MySQL in the next few weeks, so that others can learn and benefit from this.

Despite the effort involved, deploying GTID has proven to be well-worth the time commitment. The feature has provided us immediate benefits, in addition to being a base for further automation improvements in the near future.

List of GTID bugs
  • #69059 GTID lack a reasonable deployment strategy
  • #69097 Mysqld scans all binary logs on crash recovery
  • #68386 Master scans all binlogs when slave reconnects with auto positioning
  • #70659 Make crash safe slave work with gtid + less durable settings
  • #69943 Transactions skipped on slave after "stop/start slave" using GTID replication
  • #71575 Master logs two consecutive GTIDs causing slaves to miss the first GTID
  • #72313 Stop sql_thread, start sql_thread causes a trx to log with a different GTID
  • #72314 Stop io_thread, start io_thread with GTID may cause data inconsistencies
  • #72635 Data inconsistencies when master has truncated binary log with GTID after crash
  • #73032 Setting gtid_purged may break auto_position and thus slaves
  • #70711 Mysqlbinlog prints invalid SQL from relay logs when GTID is enabled
  • #73397 Make MTS work with relay_log_recovery=1 when GTID is enabled

No comments:

Post a Comment