Tuesday, January 2, 2018

Why use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, or CouchDB?

  1. If you partition your data at the application level, MySQL scalability isn't an issue. Facebook reported [1] running 1800 MySQL servers with just two DBAs. You can't do joins across partitions, but the NoSQL databases don't allow this anyway. Facebook hasn't confirmed using Cassandra as the primary source for any data, and it seems like inbox search might be their only use of it. [2]
  2. These distributed databases like Cassandra, MongoDB, and CouchDB[3] aren't actually very scalable or stable. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. When someone reports using one of these systems as their primary data store for over 1000 machines for over a year, I'll reconsider my opinion on this.

    << Update : Foursquare reported an 11-hour downtime because of MongoDB. [4] Separately, a friend's startup that was going through explosive growth tried to switch to MongoDB and gave up after a month due to instability. Twitter gave up on the Cassandra migration. [5] Facebook is moving away from Cassandra. [6] HBase is getting better but is still risky if you don't have people around with a deep understanding of it. [7] >>
  3. The primary online data store for an application is the worst place to take a risk with new technology. If you lose your database or there's corruption, it's a disaster that could be impossible to recover from. If you're not the developer of one of these new databases, and you're one of a very small number of companies using them at scale in production, you're at the mercy of the developer to fix bugs and handle scalability issues as they come up.
  4. You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica. If you have a layer of memcached servers in front of the databases (which are easy to scale out) then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.
  5. Many of the problems created by manually partitioning the data over a large number of MySQL machines can be mitigated by creating a layer below the application and above MySQL that automatically distributes data. FriendFeed described a good example implementation of this [8].
  6. Personally, I believe the relational data model is the "right" way to structure most of the data for an application like Quora (and for most user-generated content sites). Schemas allow the data to persist in a typed manner across lots of new versions of the application as it's developed, they serve as documentation, and prevent a lot of bugs. And SQL lets you move the computation to the data as necessary rather than having to fetch a ton of data and post-process it in the application everywhere. I think the "NoSQL" fad will end when someone finally implements a distributed relational database with relaxed semantics.

--------
[1] http://www.datacenterknowledge.c...
[2] What portions of Facebook use Cassandra today? 
[3] How scalable is CouchDB in practice, not just in theory? 
[4] http://blog.foursquare.com/2010/...
[5] http://engineering.twitter.com/2...
[6] http://www.facebook.com/note.php...
[7] How stable is HBase? What are the hardware requirements for setting HBase?
[8] http://bret.appspot.com/entry/ho...

No comments:

Post a Comment