Monday, August 15, 2016

Multi-Master Replication with MySQL Cluste



There are a few asynchronous replication capabilities that are unique to MySQL Cluster – one of those is that changes are replicated even if they are made directly to the data nodes using the NDB API, another is that replication can be performed in both directions i.e. multi-master. One of the advantages of this is you can share both read and write operations across both sites so that no capacity is wasted – it also gives you the confidence that either site is sane and ready to take over for the other at any point. You should aim to minimise how often the same rows are modified at the same time on both sites – conflict detection/resolution has been implemented but it can only roll-back the rows changes that conflict with other row changes rather than the full transaction.
It’s important to note that in this article, I’m not talking about the synchronous replication that takes place between data nodes within a single MySQL Cluster site (that happens in parallel and is orthogonal to the asynchronous replication to a remote site).

When performing multi-master asynchronous replication between 2 Clusters, 1 (or more) MySQL Servers in each Cluster is nominated as a master and 1 or more as slaves (it can be the same server that takes on both roles and you can have multiple channels set up in case a channel or MySQL Server is lost). Changes made to the data at either site through any of their MySQL Servers (or directly to the data nodes using the NDB API) will be replicated to the other site.
I will focus on setting up the replication, you can refer to Deploying MySQL Cluster over multiple hosts for the steps to configure and run each MySQL Cluster site.

Most of the steps are very similar to those in the first example – the main differences would come when introducing asynchronous replication to  a MySQL Cluster instance that already contains data and is up and running (processing updates) which will be covered in a subsequent article.

Setting up MySQL Asynchronous Replication for High Availability

Setting up MySQL Asynchronous Replication for High Availability

Asynchronous Replication for High Availability
MySQL replication is often positioned as a solution to provide extra throughput for your database (especially when dealing with high numbers of read operations). What tends to be overlooked is how it can be used to provide high availability (HA) – no matter how much redundancy you have at a local level, your system remains at risk from a single catastrophic failure – loss of power, earthquake, terrorist attack etc. By using MySQL asynchronous replication to maintain a geographically remote copy of that data, service can be maintained through any single site failure.
As this replication is asynchronous, there are no low latency demands on the network and so the two sites can be thousands of miles apart while utilising low-cost connections.
This article provides a simple example of how to set up asynchronous replication between 2 MySQL databases in order to provide a Highly Available solution. First of all, it will be 2 databases where the tables will be stored in the MyISAM storage engine and then between 2 MySQL Cluster databases where I also configure it for Multi-master replication where changes can be made at either site.
Subsequent articles will build on this to show:
  • Collision detection and resolution when using MySQL Cluster multi-master asynchronous replication
  • Introducing asynchronous replication to a running MySQL Cluster database
Throughout this article, two machines are used: ws1 (192.168.0.3) and ws2 (192.168.0.4). ws1 will be set up as the master and ws2 as the slave (in the multi-master configuration, both act as both master and slave).

Setting up replication for non-Cluster databases

Replication is performed from one MySQL Server to another; the master makes the changes available and then one or more other Servers pick up those changes and apply them to their own databases. In this example, both databases will store the table data using the same storage engine (MyISAM) but it’s possible to mix and match (for example, take a look at  MySQL Cluster – flexibility of replication). As this is intended as a simple introduction to replication, I’m keeping life simple by assuming that this is all being set up before the database goes into production – if that isn’t the case for you and you need to cope with existing data then check out the MySQL documentation or subsequent articles on this site.
The my.cnf files can be set up as normal but the one for the MySQL Server that will act as the Master needs to have binary-logging enabled. Also, each of the server needs to have a unique server-id. Here are the my.cnf files used for this example:
my.cnf (Master)
[mysqld]
datadir=/home/billy/mysql/myisam/data
basedir=/usr/local/mysql
port=3306
server-id=1
log-bin
my.cnf (Slave)
[mysqld]
datadir=/home/billy/mysql/myisam/data
basedir=/usr/local/mysql
port=3306
server-id=2

General Overview of Multimaster Replication

General Overview of Multimaster Replication 
 
Multimaster replication is a utility that allows data in multiple databases to be automatically kept in sync. For example, in a multimaster replication system, if a row gets inserted into one of the databases in the system, that row will be automatically propagated to all of the other databases in that system. Updates and deletes to the data in any of the databases will be propagated in the same way.
A multimaster replication environment is set up by configuring databases to be part of a “replication group”. One of the databases in the group is defined as the “master definition site,” and all of the other databases in the group are classified as “master sites.” The main difference between the two types of sites is that most of the replication administration commands must be invoked from the master definition site.

There are two basic ways that transactions get propagated to remote databases—“synchronously” and “asynchronously”. Synchronous replication occurs by causing each transaction to be applied to all the master sites in a group immediately. The way this is achieved is by using Oracle’s two- phase commit functionality, to ensure that all of the databases in question can apply a given transaction. If any of the sites in the group cannot accept the transaction (such as because the site’s database has crashed, or the network connection to a database is down) then none of the master sites in the replication group will be able to accept the transaction—the transaction will not be able to take place.
The way asynchronous replication works is that all the transactions that occur on a site are temporarily placed in a buffer, called the “deferred transaction queue,” or deftran queue. Periodically, such as once per minute, all of the transactions in a site’s deftran queue get sent to all of the other sites, by “push” jobs. These jobs get created by calling the “schedule_push” procedure. Finally, the transactions in a deftran queue that have already been sent to other sites must be periodically purged, to prevent the deftran queue from growing too large.

The vast majority of customer sites that use multimaster replication use asynchronous replication rather than synchronous. One of the reasons for this is that asynchronous replication has been available for a much longer time; the initial versions of multimaster replication only allowed for asynchronous propagation. The main reason that asynchronous is used, though, is because it has many advantages over synchronous.

First of all, asynchronous replication uses much less network bandwidth and provides higher performance than synchronous replication. The primary reason for this is that it is more efficient to store multiple transactions and then propagate them all as a group, rather than to propagate each transaction separately.

This is particularly important when the sites in question are very far apart geographically (such as having one site in San Francisco and another in New York). Another reason for these bandwidth and performance improvements is that there is much more overhead associated with synchronous replication because each and every transaction requires that separate connections be established to all of the other sites in the replication group. With asynchronous replication, fewer connections need to be established, since transactions are propagated as a group.

The biggest advantage of asynchronous replication, though, is that it provides for high availability of the replication group. With asynchronous replication, if one of the sites in the replication group crashes, all of the other sites will still be able to accept updates—the transactions that are made on the remaining sites will just “stack up” in those sites’ deftran queues until the down site becomes available.

On the other hand, with synchronous replication, if any one of the sites becomes unavailable (such as because of a database crash or a network failure) then none of the sites will be updatable. This is because with synchronous replication, each and every transaction must be able to be immediately applied to all of the sites in the replication group, and of course if a site is unreachable no transactions will be able to be applied to it. This means that not only does synchronous replication not provide any higher database availability, it can actually provide lower availability than using a single database!

Multi-Master Replication

If You *Must* Deploy Multi-Master Replication, Read This First

An increasing number of organizations run applications that depend on MySQL multi-master replication between remote sites.   I have worked on several such implementations recently.  This article summarizes the lessons from those experiences that seem most useful when deploying multi-master on existing as well as new applications.

Let's start by defining terms.  Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters.  Remote sites mean that the masters are separated by a wide area network (WAN), which implies high average network latency of 100ms or more.  WAN network latency is also characterized by a long tail, ranging from seconds due to congestion to hours or even days if a ship runs over the wrong undersea cable.

With the definitions in mind we can proceed to the lessons.  The list is not exhaustive but includes a few insights that may not be obvious if you are new to multi-master topologies.  Also, I have omitted issues like monitoring replication, using InnoDB to make slaves crash-safe, or provisioning new nodes.  If you use master/slave replication, you are likely familiar with these topics already.

1. Use the Right Replication Technology and Configure It Properly

The best overall tool for MySQL multi-master replication between sites is Tungsten.  The main reason for this assertion is that Tungsten uses a flexible, asynchronous, point-to-point, master/slave replication model that handles a wide variety of topologies such as star replication or all-to-all.  Even so, you have to configure Tungsten properly.  The following topology is currently my favorite:
  • All-to-all topology.  Each master replicates directly to every other master.  This handles prolonged network outages or replication failures well, because one or more masters can drop out without breaking  replication between the remaining masters or requiring reconfiguration.  When the broken master(s) return, replication just resumes on all sides.  All-to-all does not work well if you have a large number of masters.  
  • Updates are not logged on slaves.  This keeps master binlogs simple, which is helpful for debugging, and eliminates the possibility of loops.  It also requires some extra configuration if the masters have their own slaves, as would be the case in a Tungsten Enterprise cluster
There are many ways to set up multi-master replication replication, and the right choice varies according to the number of masters, whether you have local clustering, or other considerations.  Giuseppe Maxia has described many topologies, for example here, and the Tungsten Cookbook has even more details.

One approach you should approach with special caution is MySQL circular replication.  In topologies of three or more nodes, circular replication results in broken systems if one of the masters fails.  Also, you should be wary of any kind of synchronous multi-master replication across sites that are separated by more than 50 kilometers (i.e. 1-2ms latency).  Synchronous replication makes a siren-like promise of consistency but the price you pay is slow performance under normal conditions and broken replication when WAN links go down.

2. Use Row-Based Replication to Avoid Data Drift

Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica.  Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies.  Consider the following example, which gives a 10% raise to employees in department #35.

   UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;

If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters.  Your servers will very likely become inconsistent with statement replication.  The fix is to enable row-based replication using binlog-format=row in my.cnf.  Row replication transfers the exact row updates from each master to the others and eliminates ambiguity.

3. Prevent Key Collisions on INSERTs

For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not  collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf.  The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server.

server-id=1
auto-increment-offset = 1
auto-increment-increment = 4
This works so long as your applications use auto-increment keys faithfully.  However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect.  You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key.   Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key. 
SELECT t.table_schema, t.table_name 
  FROM information_schema.tables t 
    WHERE NOT EXISTS 
      (SELECT * FROM information_schema.columns c
       WHERE t.table_schema = c.table_schema  
         AND t.table_name = c.table_name
         AND c.column_key = 'PRI'
         AND c.extra = 'auto_increment')

4. Beware of Semantic Conflicts in Applications

Neither Tungsten nor MySQL native replication can resolve conflicts, though we are starting to design this capability for Tungsten.  You need to avoid them in your applications.  Here are a few tips as you go about this.
First, avoid obvious conflicts.  These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere.  Any of these can cause errors that will break replication or cause your masters to become out of sync.  The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions.  The bad news is that these are the easy conflicts.  There are others that are much harder to address.  
For example, accounting systems need to generate unbroken sequences of numbers for invoices.  A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice.  Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close.  Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters.  These and other such cases may force substantial application changes.  Some applications simply do not work with multi-master topologies for this reason. 

5. Remove Triggers or Make Them Harmless
Triggers are a bane of replication.  They conflict with row replication if they run by accident on the slave.  They can also create strange conflicts due to weird behavior/bugs (like this) or other problems like needing definer accounts present.  MySQL native replication turns triggers off on slaves when using row replication, which is a very nice feature that prevents a lot of problems.  
Tungsten on the other hand cannot suppress slave-side triggers.  You must instead alter each trigger to add an IF statement that prevents the trigger from running on the slave.  The technique is described in the Tungsten Cookbook.  It is actually quite flexible and has some advantages for cleaning up data because you can also suppress trigger execution on the master.  
You should regard all triggers with suspicion when moving to multi-master.  If you cannot eliminate triggers, at least find them, look at them carefully to ensure they do not generate conflicts, and test them very thoroughly before deployment.  Here's a query to help you hunt them down: 
SELECT trigger_schema, trigger_name 
  FROM information_schema.triggers;
6. Have a Plan for Sorting Out Mixed Up Data

Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master.  No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly.   Here are some thoughts on how to do this.
  1. Ensure you have tools to detect inconsistencies.  Tungsten has built-in consistency checking with the 'trepctl check' command.  You can also use the Percona Toolkit pt-table-checksum to find differences.  Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them.  
  2. Consider relaxing foreign key constraints.  I love foreign keys because they keep data in sync.  However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-by-table when synchronizing across masters.  There is an argument for being a little more relaxed in multi-master settings. 
  3. Switch masters off if possible.  Fixing problems is a lot easier if you can quiesce applications on all but one master.  
  4. Know how to fix data.  Being handy with SQL is very helpful for fixing up problems.  I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters.  Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere.  There are also various synchronization tools like pt-table-sync, but I do not know enough about them to make recommendations.  
At this point it's probably worth mentioning commercial support.  Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters.  Even better, expert advice early on can help you avoid problems in the first place.

(Disclaimer:  My company sells support for Tungsten so I'm not unbiased.  That said, commercial outfits really earn their keep on problems like this.)

7. Test Everything

Cutting corners on testing for multi-master can really hurt.  This article has described a lot of things to look for, so put together a test plan and check for them.  Here are a few tips on procedure:
  1. Set up a realistic pre-prod test with production data snapshots.  
  2. Have a way to reset your test environment quickly from a single master, so you can get back to a consistent state to restart testing. 
  3. Run tests on all masters, not just one.  You never know if things are properly configured everywhere until you try. 
  4. Check data consistency after tests.  Quiesce your applications and run a consistency check to compare tables across masters. 
It is tempting to take shortcuts or slack off, so you'll need to find ways to improve your motivation.  If it helps, picture yourself explaining to the people you work for why your DBMS servers have conflicting data with broken replication, and the problem is getting worse because you cannot take applications offline to fix things.  It is a lot easier to ask for more time to test.  An even better approach is to hire great QA people and give them time to do the job right.

Summary

Before moving to a multi-master replication topology you should ask yourself whether the trouble is justified.  You can get many of the benefits of multi-master with system-of-record architectures with a lot less heartburn.  That said, an increasing number of applications do require full multi-master across multiple sites.  If you operate one of them, I hope this article is helpful in getting you deployed or improving what you already have.

Tungsten does a pretty good job of multi-master replication already, but I am optimistic we can make it much better.  There is a wealth of obvious features around conflict resolution, data repair, and up-front detection of problems that will make life better for Tungsten users and reduce our support load.  Plus I believe we can make it easier for developers to write applications that run on multi-master DBMS topologies.  You will see more about how we do this in future articles on this blog.

Sunday, August 7, 2016

MyRocks vs InnoDB

  • InnoDB writes between 8X and 14X more data to SSD per transaction than RocksDB
  • RocksDB sustains about 1.5X more QPS
  • Compressed/uncompressed InnoDB uses 2X/3X more SSD space than RocksDB
I encourage others to use long running benchmark tests and present IO efficiency metrics in addition to performance results.


Configuration


I used the same configuration as described in the previous post with one difference. For this test I ran 168 iterations of the query step and each step ran for 1 hour. The test ran for 7 days while the previous test ran for 1 day. What I describe as QPS below is TPS (transactions/second) and when I use per query below I mean per transaction. The IO efficiency metrics are measured by iostat. I report the database size in GB at the end of each day - hours 1, 24, 48, 72, 96, 120, 144 and 168. For each one hour interval I collect:
  • average QPS
  • iostat reads per query (r/q)
  • iostat KB read per query(rKB/q)
  • iostat KB written per query (wKB/q)
  • iostat reads per second (r/s)
  • iostat KB read per second (rKB/s)
  • iostat KB written per second (wKB/s)
Tests were run for several binaries:
  • myrocks.zlib - Facebook MySQL 5.6, RocksDB with zlib compression
  • innodb56.none - upstream MySQL 5.6.26, InnoDB without compression
  • innodb57.none - upstream MySQL 5.7.10, InnoDB without compression
  • innodb56.zlib - upstream MySQL 5.6.26, InnoDB with zlib compression
  • innodb57.zlib - upstream MySQL 5.7.10, InnoDB with zlib compression

Better Compression


Compressed InnoDB uses about 2X more SSD space than MyRocks. Uncompressed InnoDB uses about 3.1X more SSD space than MyRocks. This graph shows the database size every 24 hours. Note that the database gets more data as a function of the QPS rate and MyRocks has more data than InnoDB after 168 hours -- myrocks.zlib has 6.4% more rows than inno57.none and 7.2% more rows than inno57.zlib after 7 days.

Better Performance


I'd be happy if MyRocks matched the QPS from InnoDB and only beat it on IO efficiency. But it wins on QPS and IO efficiency. The data below is QPS over time. MyRocks gets at least 1.5X more QPS than compressed InnoDB. It also does a lot better than uncompressed InnoDB but who wants to use 3X more SSD space. The QPS growth at test start for InnoDB with zlib happens because there are stalls until the compressed b-tree pages fragment. I think this is a problem with mutexes in the pessimistic insert/update code for compressed InnoDB.
The graph below shows the average QPS from each 1-hour interval.

Better IO Efficiency


I present IO efficiency metrics here using data from iostat normalized by the QPS rate to show the amount of IO done per transaction.
This result is remarkable. InnoDB writes between 8X and 14X more to storage per transaction than MyRocks. This means that workloads can use TLC SSD with MyRocks when InnoDB requires MLC and that workloads can use SSD with MyRocks when SSD doesn't have sufficient endurance for InnoDB. Running a busy database on SSD is so much easier than using disk.
This result doesn't include the additional write-amplification from flash GC that occurs with InnoDB compared to MyRocks because the MyRocks write pattern generates less work for flash GC. I previously described how that increased the flash write-rate by about 1.5X for InnoDB compared to RocksDB for the device I use. This means that the real difference in write rates on SSD might mean that InnoDB writes 12X to 21X more to storage than MyRocks.

Using iostat metrics from hour 168 InnoDB writes 8.7, 10.5, 8.9 and 10.6 times more to storage per transaction compared to RocksDB. Using iostat data from hour 167 the difference is 11.5, 13.9, 11.7 and 14.0 times more data written.

The graph below shows the number of KB written to SSD per transaction from each 1-hour interval.
This graph shows the number of SSD reads per transaction. MyRocks has the smallest rate. While an LSM can have an IO penalty for range queries and range queries are the most frequent operation in Linkbench, that isn't a problem for this workload.This graph shows the number of KB read from SSD per transaction. The rate for MyRocks is in between the rates for compressed and uncompressed InnoDB. The r/q rate is more important than this rate as long as the difference here isn't extreme. The rate for MyRocks includes reads done for user queries and reads done in the background for compaction.

Absolute iostat results


These graphs show absolute rates from iostat. The data is the average rate per 1-hour interval.

The first graph is the rate for iostat r/s. The rate is larger for MyRocks because it sustains the most QPS.
The next graph shows the rate for iostat wKB/s. Note that InnoDB sustains between 100 and 200 MB/s of writes. The SSD device is much busier doing writes for InnoDB than for MyRocks. IO capacity used for writes isn't available for reads even when endurance isn't an issue. More writes means more erases from flash GC and erases are a source of IO stalls when a read gets stuck behind the erase on the same channel.The last graph shows the rate for iostat rKB/s. The rates are larger for uncompressed InnoDB and MyRocks compared to uncompressed InnoDB. The SSD is very busy if you combine the iostat rates for rKB/s and wKB/s. Votes: