TokuMX Fractal Tree(R) indexes, what are they?
With recent release of TokuMX 1.0, we’ve made some bold claims
about how fast TokuMX can run MongoDB workloads. In this post, I want
to dig into one of the big areas of improvement, write performance and
reduced I/O.
One of the innovations of TokuMX is that it eliminates a long-held rule of databases: to get good write performance, the working set of your indexes should fit in memory.
The standard reasoning goes along the lines of: if your indexes’
working set does not fit in memory, then your writes will induce I/O,
you will become I/O bound, and performance will suffer. So, either make
sure your indexes fit in memory, or make sure your indexes have an
insertion pattern that keeps the working set small, like right-most
insertions.
With TokuMX, THIS SIMPLY ISN’T TRUE. The
innovation of Fractal Tree indexes is that as your working set grows
larger than main memory, write performance stays consistent. This
innovation is why Fractal Tree indexes perform so well on write-heavy
benchmarks (for both MongoDB and MySQL).
So how does TokuMX achieve this write performance where
many other databases struggle? By replacing B-Trees, the predominant
storage data structure in many databases (MongoDB, MySQL, BerkeleyDB,
etc…) with Fractal Tree indexes, a write-optimized data structure.
What do we mean by a write-optimized data structure?
To understand what we mean, we first need to understand why
a B-Tree struggles when indexes no longer fit in memory. Below is a
picture of a B-tree.
A B-tree is a simple (and elegant) data structure. The
internal nodes store many pivots and pointers, and the leaf nodes store
all the data. To insert into a B-tree, one must traverse to the leaf
node where the data belongs, and place the data into the leaf node. If
all of the data fits in memory, this is fast. But if most of the data
does not fit in memory (as in the picture above, where only the internal
nodes and very few leaf nodes fit), then retrieving that leaf node will
require an I/O. In fact, nearly all insertions will incur an I/O. This is where the I/O bottleneck comes from. This is where the struggling write performance comes from.
If your hard disk can do on the order of a few hundred I/O’s per
second, then your B-tree can handle at most a few hundred insertions per
second. This is why MongoDB and MySQL struggle with iiBench, and users
are justifiably told to “keep the working set of indexes in memory”.
So why are Fractal Tree indexes so much better? In short, they drastically reduce the I/O. Here is how.
The key difference between Fractal Tree indexes and B-Trees
that explains the difference in write performance can be found in the
internal nodes:
- with B-trees, internal nodes store just pivots and pointers for each child
- with Fractal Tree indexes, internal nodes store pivots, pointers, and buffers for each child
Note in the picture above that in the internal node, for each child, there is a grey buffer.
The buffers batch up write operations, so that a write works as follows:
- in the root node, find out which child the write SHOULD traverse down
- serialize the pending operation into the buffer
- if the buffer associated with that child has space, return. If the node’s buffer has no space, flush the pending operations in the buffer down a level, thereby making space for future writes.
The flush of a buffer in the root node may cause a
cascading flush. That is, the flush in the root node may flood the child
with enough data such that now the child’s buffers are full, and the
child needs to flush. This keeps happening until data eventually flushes
all the way down to leaves.
So why does this algorithm result in such better
performance? The short answer is reduced I/O (really, it’s ALL about the
I/O). With I/O’s being so expensive, if we must do an I/O we want the
benefit we receive to be worth it. With B-trees, on a write, we do an
I/O to insert one measly document or row, or key/value pair. With
Fractal Tree indexes, by assuming the root node is always in memory, we
know that when we perform an I/O on a write, we do it to flush a
buffer’s worth of data. This may contain many many documents (or rows,
etc…). With each I/O servicing many more writes, Fractal Tree indexes
reduce the amount of I/O done by a LARGE factor, thereby eliminating the
I/O bottleneck that B-Trees have.
Because of this I/O reduction, Fractal Tree indexes don’t
require indexes to fit in memory, and TokuMX is able to achieve such
high sustained write performance on data that does not fit in memory.
Another interesting thing to note about these algorithmic
properties is that if the data resides in memory, then Fractal Tree
indexes are not providing any algorithmic advantage over B-Trees for
write performance. If everything fits in memory, then algorithmically,
both data structures are fast.
No comments:
Post a Comment