Tuesday, September 20, 2016

Lessons from Deploying MySQL GTID at Scale (at Facebook)

Global Transaction ID (GTID) is one of the most compelling new features of MySQL 5.6. It provides major benefits in failover, point-in-time backup recovery, and hierarchical replication, and it's a prerequisite for crash-safe multi-threaded replication. Over the course of the last few months, we enabled GTID on every production MySQL instance at Facebook. In the process, we learned a great deal about deployment and operational use of the feature. We plan to open source many of our server-side fixes via WebScaleSQL, as we believe others in the scale community can learn from this and benefit from the work we've done.

Background

Traditional MySQL replication is based on relative coordinates — each replica keeps track of its position with respect to its current master's binary log files. GTID enhances this setup by assigning a unique identifier to every transaction, and each MySQL server keeps track of which transactions it has already executed. This permits "auto-positioning," the ability for a replica to be pointed at a master instance without needing to specify a binlog filename or position in the CHANGE MASTER statement.

Auto-positioning makes failover simpler, faster, and less error-prone. It becomes trivial to get replicas in sync after a master failure, without requiring an external tool such as Master High Availability (MHA). Planned master promotions also become easier, as it is no longer necessary to stop all replicas at the same position first. Database administrators need not worry about manually specifying incorrect positions; even in the case of human error, the server is now smart enough to ignore transactions it has already executed.

By permitting replicas to be repointed to masters at different levels of the hierarchy, GTID greatly simplifies complex replication topologies, including hierarchical replication (slaves of slaves). Since a GTID-enabled binlog stream can safely be taken from any member of a replica set, as well as replayed without requiring relative positions, the feature also eases binlog backup and recovery.

Additionally, by combining GTID with semi-synchronous replication, we have implemented automation to safely recover crashed masters as replicas. When a master crashes, we can detect this and promote a replica within 30 seconds without losing data. Later, if the original master was able to be recovered and our automation detects its data is consistent, GTID allows us to repoint it to the new master instead of having to kick off a copy operation to replace it.

Deployment

GTID in MySQL 5.6 is extremely difficult to deploy to an existing large-scale environment. The gtid_mode variable is not dynamic, and also must match between master and replicas. The official deployment plan in the MySQL documentation involves making a master read_only, shutting down all MySQL instances in the replica set simultaneously at the same position, enabling the gtid_mode variable in my.cnf, and then starting all of the instances. This process is fundamentally incompatible with the notion of high availability, making it unviable for production use at scale.

We worked around this problem in fb-mysql by relaxing the constraint that gtid_mode must match between master and replicas. We permit a gtid_mode=ON replica to have a gtid_mode=OFF master, and we suppress assignment of GTIDs to transactions on servers with read_only=ON (i.e., replicas). This permits a high-availability deployment strategy as follows, for each replica set:
  1. On each replica, one at a time, restart MySQL to enable gtid_mode. Afterward, we are in a state where every replica has gtid_mode=ON, but the master still has gtid_mode=OFF.
  2. Perform a master promotion as normal, repointing the replicas and original master to a new master. The original master's replication will intentionally break when started, since it still has gtid_mode=OFF.
  3. Restart the original master to enable gtid_mode. It will now be able to replicate from the new master, and the entire replica set now has gtid_mode=ON.
With sufficient safeguards and validation logic, it is safe to execute this rollout process to a large number of replica sets at a time. During the peak of the deployment process, we were running our rollout script on up to hundreds of replica sets simultaneously.

fb-mysql changes

To deploy GTID and make it work in our production environment, we had to make huge changes both in the MySQL server and in the automation around it. Apart from the deployment changes, during initial testing we have encountered a number of serious bugs and performance regressions with GTID.

In some cases, the MySQL server ended up scanning all binary logs whenever a replica connected with the new auto-position protocol, or during crash recovery, to initialize the global variables GTID_PURGED and GTID_EXECUTED. Opening the binary logs is required to read the previous_gtid_events present at the beginning of the binary log. In fb-mysql, we fixed these issues by changing the format of binary log index file to include previous_gtid_event corresponding to each binary log file name. Fb-mysql uses the previous_gtid_events present in binary log index file instead of opening the binary logs directly thus improving performance.

We also found that slaves with GTID are not crash-safe with less durable settings (sync_binlog !=1 and innodb_flush_log_at_trx_commit != 1). Using fully durable settings requires syncing both the binary log and innodb transaction log to disk after each transaction in single-threaded replication mode, which negatively affects slave apply performance. It is important for any feature to be crash-safe to avoid operational overhead at Facebook scale. So in fb-mysql, we decided to fix this issue by adding a new transaction table (mysql.slave_gtid_info) to consistently store the GTID information.

GTID is a powerful feature that simplifies many replication complexities. For example, we identified multi-threaded slave doesn’t work along with relay_log_recovery=1, but relay_log_recovery=1 is required for crash safety even with less durable replication settings (sync_relay_log != 1). In fb-mysql we allowed the use of relay_log_recovery=1 with multi-threaded slave when GTID is enabled, since gaps in execution created after the multi-threaded slave crash are automatically filled by GTID auto-positioning.

Preparation

There were several steps we had to take prior to beginning our GTID deployment. One major step involved updating all of our automation to use GTID and auto-positioning. Enabling gtid_mode without use of auto-positioning is detrimental to replica crash-safety, so it is important to roll out both at once. The most substantial change was to our promotion logic, which now had to cover additional permutations for whether GTID was already enabled, or being enabled for the first time.

Another important prerequisite involves prevention of GTID-incompatible statements. MySQL has an option, enforce_gtid_consistency, that causes these statements to generate an error. For safety's sake, this should always be enabled whenever gtid_mode is also enabled. However, before beginning the rollout, it is necessary to audit applications and preemptively fix any uses of these query patterns. To make this possible at our scale, we augmented MySQL to add user stat counters for these statements, as well as an option to write full information on them to the MySQL error log. This allowed us to easily identify around 20 cases of these query patterns being used, among our thousands of special-case workloads.

Finally, we wrote a script to aid in skipping statements, in the rare cases where that is necessary. The age-old variable sql_slave_skip_counter does not work with gtid_mode enabled; instead, a DBA must fiddle with the gtid_next variable and insert an empty transaction. This is painful in an emergency, especially while a large DBA team is still ramping up on GTID knowledge, so having a helper script is prudent.

For a more in-depth technical overview of our GTID-related MySQL changes and automation efforts, please see our slides from Percona Live MySQL Conference 2014.

Summary

Facebook's Global Transaction ID deployment was a cross-functional collaboration between our MySQL engineering, database operations, and data performance teams. Deploying GTID to a Facebook-scale environment required substantial effort, including major improvements to the MySQL server, changes to our automation, and a custom rollout script. Early on in the rollout process, we also uncovered, reported, and fixed several bugs and performance issues in MySQL relating to GTID support. We can happily state that it is now extremely stable in our use, with no new problems encountered in recent months. We plan to make these same fixes to the WebScaleSQL branch of MySQL in the next few weeks, so that others can learn and benefit from this.

Despite the effort involved, deploying GTID has proven to be well-worth the time commitment. The feature has provided us immediate benefits, in addition to being a base for further automation improvements in the near future.

List of GTID bugs
  • #69059 GTID lack a reasonable deployment strategy
  • #69097 Mysqld scans all binary logs on crash recovery
  • #68386 Master scans all binlogs when slave reconnects with auto positioning
  • #70659 Make crash safe slave work with gtid + less durable settings
  • #69943 Transactions skipped on slave after "stop/start slave" using GTID replication
  • #71575 Master logs two consecutive GTIDs causing slaves to miss the first GTID
  • #72313 Stop sql_thread, start sql_thread causes a trx to log with a different GTID
  • #72314 Stop io_thread, start io_thread with GTID may cause data inconsistencies
  • #72635 Data inconsistencies when master has truncated binary log with GTID after crash
  • #73032 Setting gtid_purged may break auto_position and thus slaves
  • #70711 Mysqlbinlog prints invalid SQL from relay logs when GTID is enabled
  • #73397 Make MTS work with relay_log_recovery=1 when GTID is enabled

MySQL Replication and GTID-based failover

MySQL Replication and GTID-based failover

For years, MySQL replication used to be based on binary log events - all a slave knew was the exact event and the exact position it just read from the master. Any single transaction from a master may have ended in different binary logs, and in different positions in these logs. It was a simple solution that came with limitations - more complex topology changes could require an admin to stop replication on the hosts involved. Or these changes could cause some other issues, e.g., a slave couldn’t be moved down the replication chain without time-consuming rebuild process (we couldn’t easily change replication from A -> B -> C to A -> C -> B without stopping replication on both B and C). We’ve all had to work around these limitations while dreaming about a global transaction identifier.

GTID was introduced along with MySQL 5.6, and brought along some major changes in the way MySQL operates. First of all, every transaction has an unique identifier which identifies it in a same way on every server. It’s not important anymore in which binary log position a transaction was recorded, all you need to know is the GTID: ‘966073f3-b6a4-11e4-af2c-080027880ca6:4’. GTID is built from two parts - the unique identifier of a server where a transaction was first executed, and a sequence number. In the above example, we can see that the transaction was executed by the server with server_uuid of ‘966073f3-b6a4-11e4-af2c-080027880ca6’ and it’s 4th transaction executed there. This information is enough to perform complex topology changes - MySQL knows which transactions have been executed and therefore it knows which transactions need to be executed next. Forget about binary logs, it’s all in the GTID.
So, where can you find GTID’s? You’ll find them in two places. On a slave, in 'show slave status;' you’ll find two columns: Retrieved_Gtid_Set and Executed_Gtid_Set. First one covers GTID’s which were retrieved from the master via replication, the second informs about all transactions which were executed on given host - both via replication or executed locally.

Setting up a Replication Cluster the easy way

We’ll use the Severalnines Configurator to automatically deploy our replication setup. First, you need to point your browser at:
http://www.severalnines.com/replication-configurator/
The first page doesn’t give you too many options so you can as well click on ‘Next’ button.

The next screen contains some options regarding operating system, where the infrastructure will be created and so forth. All of options are explained in details, one thing that may be worth commenting is the ‘Number of MySQL Slaves’. By default, the deploy scripts create a master-master pair connected by semi-sync replication. This is the smallest possible block created when you set the number of slaves to 0. Every slave that you want to create will be connected to this master-master pair.

The third screen is related to the MySQL configuration of the database nodes - you can define how ‘large’ nodes will be in terms of CPU and memory, you can also set up InnoDB buffer pool sizes and predicted workload pattern.
The last screen lets you fill in the IP addresses of the ClusterControl server and the nodes in the replication setup.
Finally, you need to fill in your email address to which file with deployment scripts will be sent.

Deployment
For this blog post, let’s assume we want to create infrastructure on premises. We’ll use couple of Vagrant nodes. Deployment on EC2 may work in a slightly different way that what you see below.
When all nodes are up and running, you need to copy the tarball that you received via email, to the ClusterControl node. Next,  untar it, go to the install directory and execute the deploy.sh script:
1
2
3
$ tar zxf s9s-mysql-56.tar.gz
$ cd s9s-mysql-56/mysql/scripts/install/
$ ./deploy.sh
At the beginning of the deploy process you’ll be asked the following:
1
2
Can you SSH from this host to all other hosts without password?
Choosing 'n' will allow you to setup shared keys. (y/n):
If you have passwordless ssh set up already, you can choose ‘y’ here. In other case you’ll be asked to provide passwords for root user on all nodes to generate and distribute ssh keys.
The deployment will continue, and you should have you replication setup up and running after 15 to 20 minutes. We can now  take a look at how GTID works.

Errant transactions -  what is the issue?
As we mentioned at the beginning of this post, GTID’s brought a significant change in the way people should think about MySQL replication. It’s all about habits. Let’s say, for some reason, that an application performed a write on one of the slaves. It shouldn’t have happened but surprisingly, it happens all the time. As a result, replication stops with duplicate key error. There are couple of ways to deal with such problem. One of them would be to delete the offending row and restart replication. Other one would be to skip the binary log event and then restart replication.
1
mysql> STOP SLAVE SQL_THREAD; SET GLOBAL sql_slave_skip_counter = 1; START SLAVE SQL_THREAD;
Both ways should bring replication back to work, but they may introduce data drift so it is necessary to remember that slave consistency should be checked after such event (pt-table-checksum and pt-table-sync works well here).
If a similar problem happens while using GTID, you’ll notice some differences. Deleting the offending row may seem to fix the issue, replication should be able to commence. The other method, using sql_slave_skip_counter won’t work at all - it’ll return an error. Remember, it’s now not about binlog events, it’s all about GTID being executed or not.
Why deleting the row only ‘seems’ to fix the issue? One of the most important things to keep in mind regarding GTID is that a slave, when connecting to the master, checks if it is missing any transactions which were executed on the master. These are called errant transactions. If a slave finds such transactions, it will execute them. Let’s assume we ran following SQL to clear an offending row:
1
mysql> DELETE FROM mytable WHERE id=100;
Let’s check show slave status:
1
2
3
4
                  Master_UUID: 966073f3-b6a4-11e4-af2c-080027880ca6
           Retrieved_Gtid_Set: 966073f3-b6a4-11e4-af2c-080027880ca6:1-29
            Executed_Gtid_Set: 84d15910-b6a4-11e4-af2c-080027880ca6:1,
966073f3-b6a4-11e4-af2c-080027880ca6:1-29,
And see where the 84d15910-b6a4-11e4-af2c-080027880ca6:1 comes from:
1
2
3
4
5
mysql> SHOW VARIABLES LIKE 'server_uuid'\G
*************************** 1. row ***************************
Variable_name: server_uuid
        Value: 84d15910-b6a4-11e4-af2c-080027880ca6
1 row in set (0.00 sec)
As you can see, we have 29 transactions that came from the master, UUID of 966073f3-b6a4-11e4-af2c-080027880ca6 and one that was executed locally. Let’s say that at some point we failover and the master (966073f3-b6a4-11e4-af2c-080027880ca6) becomes a slave. It will check its list of executed GTID’s and will not find this one: 84d15910-b6a4-11e4-af2c-080027880ca6:1. As a result, the related SQL will be executed:
1
mysql> DELETE FROM mytable WHERE id=100;
This is not something we expected… If, in the meantime, the binlog containing this transaction would be purged on the old slave, then the new slave will complain after failover:
1
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

How to detect errant transactions?
MySQL provides two functions which come in very handy when you want to compare GTID sets on different hosts.
GTID_SUBSET() takes two GTID sets and checks if the first set is a subset of the second one.
 Let’s say we have following state.
Master:
1
2
3
4
5
6
7
8
9
10
mysql> show master status\G
*************************** 1. row ***************************
             File: binlog.000002
         Position: 160205927
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: 8a6962d2-b907-11e4-bebc-080027880ca6:1-153,
9b09b44a-b907-11e4-bebd-080027880ca6:1,
ab8f5793-b907-11e4-bebd-080027880ca6:1-2
1 row in set (0.00 sec)
Slave:
1
2
3
4
5
6
7
mysql> show slave status\G
[...]
           Retrieved_Gtid_Set: 8a6962d2-b907-11e4-bebc-080027880ca6:1-153,
9b09b44a-b907-11e4-bebd-080027880ca6:1
            Executed_Gtid_Set: 8a6962d2-b907-11e4-bebc-080027880ca6:1-153,
9b09b44a-b907-11e4-bebd-080027880ca6:1,
ab8f5793-b907-11e4-bebd-080027880ca6:1-4
We can check if the slave has any errant transactions by executing the following SQL:
1
2
3
4
mysql> SELECT GTID_SUBSET('8a6962d2-b907-11e4-bebc-080027880ca6:1-153,ab8f5793-b907-11e4-bebd-080027880ca6:1-4', '8a6962d2-b907-11e4-bebc-080027880ca6:1-153, 9b09b44a-b907-11e4-bebd-080027880ca6:1, ab8f5793-b907-11e4-bebd-080027880ca6:1-2') as is_subset\G
*************************** 1. row ***************************
is_subset: 0
1 row in set (0.00 sec)
Looks like there are errant transactions. How do we identify them? We can use another function, GTID_SUBTRACT()
1
2
3
4
mysql> SELECT GTID_SUBTRACT('8a6962d2-b907-11e4-bebc-080027880ca6:1-153,ab8f5793-b907-11e4-bebd-080027880ca6:1-4', '8a6962d2-b907-11e4-bebc-080027880ca6:1-153, 9b09b44a-b907-11e4-bebd-080027880ca6:1, ab8f5793-b907-11e4-bebd-080027880ca6:1-2') as mising\G
*************************** 1. row ***************************
mising: ab8f5793-b907-11e4-bebd-080027880ca6:3-4
1 row in set (0.01 sec)
Our missing GTID’s are ab8f5793-b907-11e4-bebd-080027880ca6:3-4 - those transactions were executed on the slave but not on the master.

How to solve issues caused by errant transactions?
There are two ways - inject empty transactions or exclude transactions from GTID history.
To inject empty transactions we can use the following SQL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
mysql> SET gtid_next='ab8f5793-b907-11e4-bebd-080027880ca6:3';
Query OK, 0 rows affected (0.01 sec)
 
mysql> begin ; commit;
Query OK, 0 rows affected (0.00 sec)
   
Query OK, 0 rows affected (0.01 sec)
 
mysql> SET gtid_next='ab8f5793-b907-11e4-bebd-080027880ca6:4';
Query OK, 0 rows affected (0.00 sec)
 
mysql> begin ; commit;
Query OK, 0 rows affected (0.00 sec)
 
Query OK, 0 rows affected (0.01 sec)
 
mysql> SET gtid_next=automatic;
Query OK, 0 rows affected (0.00 sec)
This has to be executed on every host in the replication topology that does not have those GTID’s executed. If the master is available, you can inject those transactions there and let them replicate down the chain. If the master is not available (for example, it crashed), those empty transactions have to be executed on every slave. Oracle developed a tool called mysqlslavetrx which is designed to automate this process.
Another approach is to remove the GTID’s from history:
Stop slave:
1
mysql> STOP SLAVE;
Print Executed_Gtid_Set on the slave:
1
mysql> SHOW MASTER STATUS\G
Reset GTID info:
1
mysql> RESET MASTER;
Set GTID_PURGED to a correct GTID set. based on data from SHOW MASTER STATUS. You should exclude errant transactions from the set:
1
mysql> SET GLOBAL GTID_PURGED='8a6962d2-b907-11e4-bebc-080027880ca6:1-153, 9b09b44a-b907-11e4-bebd-080027880ca6:1, ab8f5793-b907-11e4-bebd-080027880ca6:1-2';
Start slave:
1
mysql> START SLAVE\G
In every case, you should verify consistency of your slaves using pt-table-checksum and pt-table-sync (if needed) - errant transaction may result in a data drift.

TokuMX Fractal Tree(R) indexes, what are they?

TokuMX Fractal Tree(R) indexes, what are they?

With recent release of TokuMX 1.0, we’ve made some bold claims about how fast TokuMX can run MongoDB workloads. In this post, I want to dig into one of the big areas of improvement, write performance and reduced I/O.
One of the innovations of TokuMX is that it eliminates a long-held rule of databases: to get good write performance, the working set of your indexes should fit in memory. The standard reasoning goes along the lines of: if your indexes’ working set does not fit in memory, then your writes will induce I/O, you will become I/O bound, and performance will suffer. So, either make sure your indexes fit in memory, or make sure your indexes have an insertion pattern that keeps the working set small, like right-most insertions.
With TokuMX, THIS SIMPLY ISN’T TRUE. The innovation of Fractal Tree indexes is that as your working set grows larger than main memory, write performance stays consistent. This innovation is why Fractal Tree indexes perform so well on write-heavy benchmarks (for both MongoDB and MySQL).
So how does TokuMX achieve this write performance where many other databases struggle? By replacing B-Trees, the predominant storage data structure in many databases (MongoDB, MySQL, BerkeleyDB, etc…) with Fractal Tree indexes, a write-optimized data structure.
What do we mean by a write-optimized data structure?
To understand what we mean, we first need to understand why a B-Tree struggles when indexes no longer fit in memory. Below is a picture of a B-tree.

A B-tree is a simple (and elegant) data structure. The internal nodes store many pivots and pointers, and the leaf nodes store all the data. To insert into a B-tree, one must traverse to the leaf node where the data belongs, and place the data into the leaf node. If all of the data fits in memory, this is fast. But if most of the data does not fit in memory (as in the picture above, where only the internal nodes and very few leaf nodes fit), then retrieving that leaf node will require an I/O. In fact, nearly all insertions will incur an I/O. This is where the I/O bottleneck comes from. This is where the struggling write performance comes from. If your hard disk can do on the order of a few hundred I/O’s per second, then your B-tree can handle at most a few hundred insertions per second. This is why MongoDB and MySQL struggle with iiBench, and users are justifiably told to “keep the working set of indexes in memory”.
So why are Fractal Tree indexes so much better? In short, they drastically reduce the I/O. Here is how.
The key difference between Fractal Tree indexes and B-Trees that explains the difference in write performance can be found in the internal nodes:
  • with B-trees, internal nodes store just pivots and pointers for each child
  • with Fractal Tree indexes, internal nodes store pivots, pointers, and buffers for each child

Note in the picture above that in the internal node, for each child, there is a grey buffer.
The buffers batch up write operations, so that a write works as follows:
  • in the root node, find out which child the write SHOULD traverse down
  • serialize the pending operation into the buffer
  • if the buffer associated with that child has space, return. If the node’s buffer has no space, flush the pending operations in the buffer down a level, thereby making space for future writes.
The flush of a buffer in the root node may cause a cascading flush. That is, the flush in the root node may flood the child with enough data such that now the child’s buffers are full, and the child needs to flush. This keeps happening until data eventually flushes all the way down to leaves.
So why does this algorithm result in such better performance? The short answer is reduced I/O (really, it’s ALL about the I/O). With I/O’s being so expensive, if we must do an I/O we want the benefit we receive to be worth it. With B-trees, on a write, we do an I/O to insert one measly document or row, or key/value pair. With Fractal Tree indexes, by assuming the root node is always in memory, we know that when we perform an I/O on a write, we do it to flush a buffer’s worth of data. This may contain many many documents (or rows, etc…). With each I/O servicing many more writes, Fractal Tree indexes reduce the amount of I/O done by a LARGE factor, thereby eliminating the I/O bottleneck that B-Trees have.
Because of this I/O reduction, Fractal Tree indexes don’t require indexes to fit in memory, and TokuMX is able to achieve such high sustained write performance on data that does not fit in memory.
Another interesting thing to note about these algorithmic properties is that if the data resides in memory, then Fractal Tree indexes are not providing any algorithmic advantage over B-Trees for write performance. If everything fits in memory, then algorithmically, both data structures are fast.

Monday, September 19, 2016

Is Synchronous Replication right for your app?

Is Synchronous Replication right for your app?

I talk with lot of people who are really interested in Percona XtraDB Cluster (PXC) and mostly they are interested in PXC as a high-availability solution.  But, what they tend not to think too much about is if moving from async to synchronous replication is right for their application or not.

Facts about Galera replication

There’s a lot of different facts about Galera that come into play here, and it isn’t always obvious how they will affect your database workload.  For example:
  • Transaction commit takes approximately the worst packet round trip time (RTT) between any two nodes in your cluster.
  • Transaction apply on slave nodes is still asynchronous from client commit (except on the original node where the transaction is committed)
  • Galera prevents writing conflicts to these pending transactions while they are inflight in the form of deadlock errors.  (This is actually a form of Eventual Consistency where the client is forced to correct the problem before it can commit.  It is NOT the typical form of Eventual Consistency, known as asynchronous repair, that most people think of).

Callaghan’s Law

But what does that all actually mean?  Well, at the Percona Live conference a few weeks ago I heard a great maxim that really helps encapsulate a lot of this information and puts it into context with your application workload:
[In a Galera cluster] a given row can’t be modified more than once per RTT
This was attributed to Mark Callaghan from Facebook by Alexey Yurchenko from Codership at his conference talk.  Henceforth this will be known as “Callaghan’s law” in Galera circles forever, though Mark didn’t immediately recall saying it.

Applied to a standalone Innodb instance

Let’s break it down a bit.  Our unit of locking in Innodb is a single row (well, the PRIMARY KEY index entry for that row).  This means typically on a single Innodb node we can have all sorts modifications floating around as long as they don’t touch the same row.  Row locks are held for modifications until the transaction commits and that takes an fsync to the redo log by default, so applying Callaghan’s law to single-server Innodb, we’d get:
[On a single node Innodb server] a given row can’t be modified more than the time to fsync
You can obviously relax that by simply not fsyncing every transaction (innodb_flush_log_at_trx_commit != 1), or work around it with by fsyncing to memory (Battery or capacitor-backed write cache), etc., but the principle is basically the same.  If we want this transaction to persist after a crash, it has to get to disk.
This has no effect on standard MySQL replication from this instance, since MySQL replication is asynchronous.

What about semi-sync MySQL replication?

It’s actually much worse than Galera.  As I illustrated in a blog post last year, semi-sync must serialize all transactions and wait for them one at a time.  So, Callaghan’s law applied to semi-sync is:
[On a semi-sync replication master] you can’t commit (at all) more than once per RTT. 

Applied to a Galera cluster

In the cluster we’re protecting the data as well, though not by ensuring it goes to disk (though you can do that).  We protect the data by ensuring it gets to every node in the cluster.
But why every node and not just a quorum?  Well, it turns out transaction ordering really, really matters (really!).  By enforcing replication to all nodes, we can (simultaneously) establish global ordering for the transaction, so by the time the original node gets acknowledgement of the transaction back from all the other nodes, a GTID will also (by design) be established.  We’ll never end up with non-deterministic ordering of transactions as a result.
So this brings us back to Callaghan’s law for Galera.  We must have group communication to replicate and establish global ordering for every transaction, and the expense of doing that for Galera is approximately one RTT between the two nodes in the cluster that are furthest apart (regardless of where the commit comes from!).  The least amount of data we can change in Innodb at a time is a single row, so the most any single row can be modified cluster-wide is once per RTT.

What about WAN clusters?

Callaghan’s law applies to WAN clusters as well.  LANs usually have sub-millisecond RTTs.  WANs usually have anywhere from a few ms up to several hundred.  This really will open a large window where rows won’t be able to be updated more than just a few times a second at best.

Some things the rule does not mean on Galera

  • It does NOT mean you can’t modify different rows simultaneously.  You can.
  • It does NOT mean you can’t modify data on multiple cluster nodes simultaneously.  You can.
  • It does NOT set an lower bound on performance, only a upper bound.  The best performance you can expect is modifying a given row once per RTT, it could get slower if apply times start to lag.

So what about my application?

Think about your workload.  How frequently do you update any given row?  We call rows that are updated heavily “hotspots“.

Examples of hotspots

Example 1: Your application is an online game and you keep track of global achievement statistics in a single table with a row for each stat; there are just a few hundred rows.  When a player makes an achievement, your application updates this table with a statement like this:
How many players might accomplish this achievement at the same time?
Example 2: You have users and groups in your application.  These are maintained in separate tables and there also exists a users_groups table to define the relationship between them.  When someone joins a group, you run a transaction that adds the relationship row to users_groups, but also updates groups with some metadata:
How often might multiple users join the same group?

Results

In both of the above examples you can imagine plenty of concurrent clients attempting to modify the same record at once.  But what will actually happen to the clients who try to update the same row within the same RTT?  This depends on which node in the cluster the writes are coming from:
From the same node: This will behave just like standard Innodb.  The first transaction will acquire the necessary row locks while it commits (which will take the 1 RTT).  The other transactions will lock wait until the lock(s) they need are available.  The application just waits in those cases.
From other nodes: First to commit wins.  The others that try to commit AFTER the first and while the first is still in the local apply queue on their nodes will get a deadlock error.
So, the best case (which may not be best for your application database throughput) will be more write latency into the cluster.  The worst case is that your transactions won’t even commit and you have to take some action you normally wouldn’t have had to do.

Workarounds

If your hotspots were really bad in standalone Innodb, you might consider relaxing the fsync:  set innodb_flush_log_at_trx_commit to something besides 1 and suddenly you can update much faster.  I see this tuning very frequently for “performance” reasons when data durability isn’t as crucial.  This is fine as long as you weigh both options carefully.
But in Galera you cannot relax synchronous replication.  You can’t change the law, you can only adapt around it, but how might you do that ?

Write to one node

If your issue is really the deadlock errors and not so much the waiting, you could simply send all your writes to one node.  This should prevent the deadlock errors, but will not change the lock waiting that your application will need to do for hotspots.

wsrep_retry_autocommit

If your hotspots are all updates with autocommits, you can rely on wsrep_retry_autocommit to auto-retry the transactions for you.  However, each autocommit is retried only the number of times specified by this variable (default is 1 retry).  This means more waiting, and after the limit is exceeded you will still get the deadlock error.
This is not implemented for full BEGIN … COMMIT multi-statement transactions since it cannot be assumed that those are not applying application logic in between the statements that is not safe to retry after the database state changes.

retry deadlocks

Now we start to get into (*gasp*) territory where your application needs to be modified.  Generally if you use Innodb, you should be able to handle deadlock errors in your application.  Raise your hands if your application has that logic (I usually get less than 5 people who do out of 100).
But, what to do?  Retrying automatically, or giving your end user a chance to retry manually are typical answers.  However, this means more latency waiting for a write to go through, and possibly some poor user experience.

batch writes

Instead of updating global counters one at a time (from Example 1, above), how about maintaining the counter in memcache or redis and only flushing to the database periodically?

change your schema

In Example 2, above, how above moving the ‘joined’ column to the users_groups table so we don’t need to update the parent group row so often?

Conclusion

Choosing a system to replicate your data to a distributed system requires tradeoffs.  Most of us are used to the tradeoffs we take when deploying conventional stand-alone MySQL Innodb with asynchronous slaves.  We may not think about the tradeoffs, but we’re making them (anyone obsessively testing slave position to ensure it’s caught up with the master?).
Synchronous replication with PXC and Galera is no different in that there are trade-offs, they just aren’t what we commonly expect.

If Callaghan’s law is going to cause you trouble and you are not prepared to adapt to work with it, PXC/Galera Synchronous replication is probably not right for you