Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Blog   >   Cassandra   >   MongoDB vs Cassandra

MongoDB vs Cassandra

Cassandra Logo

Over the 2 years we’ve been using MongoDB in production with our server monitoring tool, Server Density, we’ve built up significant experience and knowledge about how it works. Back in 2009 when I was looking at a replacement for MySQL I looked at Cassandra but dismissed it because MongoDB had several advantages, and Cassandra was still extremely early stage (even more so than MongoDB at the time). Having been invited to give a comparison at the Cassandra London Meetup, I thought I’d revisit it to see how it compares today.

Disclaimer: It’s important to note that much of what I know about MongoDB has been learnt through using it in production. We don’t use Cassandra so any comparisons are going to be fairly superficial but they will still be relevant because that’s the stage most people will be in when they are considering which database to pick. As a result of this I will try to avoid making technical comparisons about specific features because this will be biased towards my extensive understanding on MongoDB vs a limited understanding of Cassandra.

As such, this comparison is split into 2 types of difference – usage and operations.

  • Usage: The actual usage as a developer implementing the application with the database.
  • Operations: Points which are not directly about the core database but it’s suitability for production and management on an operational level.

That said, I will start with several technical comparisons because these are important to understand.

Usage – Structure

MongoDB acts much like a relational database. Its data model consists of a database at the top level, then collections which are like tables in MySQL (for example) and then documents which are contained within the collection, like rows in MySQL. Each document has a field and a value where this is similar to columns and values in MySQL. Fields can be simple key / value e.g. { 'name': 'David Mytton' } but they can also contain other documents e.g. { 'name': { 'first' : David, 'last' : 'Mytton' } }.

In Cassandra documents are known as “columns” which are really just a single key and value. e.g. { 'key': 'name', 'value': 'David Mytton' }. There’s also a timestamp field which is for internal replication and consistency. The value can be a single value but can also contain another “column”. These columns then exist within column families which order data based on a specific value in the columns, referenced by a key. At the top level there is a keyspace, which is similar to the MongoDB database.

A good set of data model diagrams for Cassandra can be found here.

Usage – Indexes

MongoDB indexes work very similar to relational databases. You create single or compound indexes on the collection level and every document inserted into that collection has those fields indexed. Querying by index is extremely fast so long as you have all your indexes in memory.

Prior to Cassandra 0.7 it was essentially a key/value store so if you want to query by the contents of a key (i.e the value) then you need to create a separate column which references the other columns i.e. you create your own indexes. This changed in Cassandra 0.7 which allowed secondary indexes on column values, but only through the column families mechanism.

Cassandra requires a lot more meta data for indexes and requires secondary indexes if you want to do range queries. E.g. if we define a new column family with 1 index:

$ bin/cassandra-cli --host localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] create keyspace demo;
[default@unknown] use demo;
[default@demo] create column family users with comparator=UTF8Type
...	and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
...	{column_name: birth_date, validation_class: LongType, index_type: KEYS}];

then we cannot do range queries:

[default@demo] get users where state = 'UT' and birth_date > 1970;
No indexed columns present in index clause with operator EQ

We must create a secondary index:

update column family users with comparator=UTF8Type
...	and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
...	{column_name: birth_date, validation_class: LongType, index_type: KEYS},
...	{column_name: state, validation_class: UTF8Type, index_type: KEYS}];

Then Cassandra can use the state as the primary and filter based on the birth_date:

get users where state = 'UT' and birth_date > 1970;

(Code samples taken from this blog post).

Usage – Deployment

MongoDB is written in C++ and provided in binary form for Linux, OS X, Windows and several other platforms. It’s extremely easy to “install” – download, extract and run mongod.

Cassandra is written in Java and has the overhead that brings, but also the easy ability to integrate into existing Java projects. It takes a little longer to get started but there is a demonstration of setting up a 4 node cluster in less than 2 minutes, which you’d struggle to beat with MongoDB.

I know plenty of people running MongoDB on Windows but would be interested to hear if that’s the same with Cassandra (I suspect it’s more Linux).

Operations/Usage – Consistency/Replication

In MongoDB replication is achieved through replica sets. This is an enhanced master/slave model where you have a set of nodes where one is the master. Data is replicated to all nodes so that if the master fails, another member will take over. There are configuration options to determine which nodes have priority and you can set options like sync delay to have nodes lag behind (for disaster recovery, for example).

Writes in MongoDB are “unsafe” by default; data isn’t written right away by default so it’s possible that a write operation could return success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains high performance. If you need increased durability then you can specify a safe write which will guarantee the data is written to disk before returning. Further, you can require that the data also be successfully written to n replication slaves.

MongoDB drivers also support the ability to read from slaves. This can be done on a connection, database, collection or even query level and the drivers handle sending the right queries to the right slaves, but there is no guarantee of consistency (unless you are using the option to write to all slaves before returning). In contrast Cassandra queries go to every node and the most up to date column is returned (based on the timestamp value).

Cassandra has much more advanced support for replication by being aware of the network topology. The server can be set to use a specific consistency level to ensure that queries are replicated locally, or to remote data centres. This means you can let Cassandra handle redundancy across nodes where it is aware of which rack and data centre those nodes are on. Cassandra can also monitor nodes and route queries away from “slow” responding nodes.

The only disadvantage with Cassandra is that these settings are done on a node level with configuration files whereas MongoDB allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.

Operations – Who’s behind it?

10gen

Both Cassandra (Apache 2.0 license) and MongoDB (AGPL) are open source. You can freely download the code, write patches and submit them upstream. However, Cassandra is purely an open source project whereas MongoDB is “owned” by a commercial company, 10gen. The original authors of MongoDB are core contributors to the code and work for 10gen (indeed, 10gen was founded specifically to support MongoDB and the CEO and CTO are the original creators).

Apache logo

In contrast, Cassandra was created by 2 engineers from Facebook and is incubated by the Apache Foundation. This is not a disadvantage (indeed, the Apache Web server used by the majority of websites has similar roots and is part of the Apache Foundation) but is important to understand when it comes to support, ongoing development and the community (below).

Operations – Support

Although there are independent consultants for MongoDB, the best place to get support is from 10gen themselves because they wrote the database so they know it best. They’re able to provide support contracts with phone and e-mail SLAs.

In contrast, Cassandra has several companies offering commercial support and whilst they do have committers to the core Cassandra code, I’d argue it’s not the same as having access to the entire engineering team and original authors from a single contact point, as is the case with MongoDB.

Operations – Ongoing development

MongoDB Github

Interacting directly with the company that controls the main project, especially for support purposes, means you can have bug fixes and changes implemented to the code base. We’ve had numerous fixes committed as a result of problems discovered in our production usage of MongoDB. We pay 10gen for support now but even before we did they were very responsive to bugs. We also get votes for features and improvements.

In theory this is the same in Cassandra – you’d want bugs to be fixed and features implemented but that doesn’t have to happen because of the nature of open source projects run by volunteers (becomes more complex when companies are paying developers to work on the project e.g. Eric Evans from Rackspace working on Cassandra full time).

Of course there is a risk that the company behind the project disappears and all the engineers move on somewhere else but the project is still open source and this is the same with any piece of software you might use.

You could also argue there is more direction and focus from a commercial company working solely on the product (and more engineers dedicated to it) but I don’t want to go any further with this point as this post isn’t about open source vs commercial. This is just one point to be aware of.

Operations – Documentation

Cassandra Wiki

The official Cassandra documentation is poor. Researching for this I had to visit several websites and watch videos even to get explanations for key concepts like indexes. There is better documentation from Datastax but that is still lacking in explaining concepts in any depth.

The MongoDB documentation was good when I first looked at it but is even better nowadays. It’s actually kept up to date and covers all the features, with examples. Nobody likes writing documentation and it shows with many open source projects; another advantage of having a company behind the project, forcing developers to write the docs! Incidentally, one of the biggest advantages of the PHP language is the extensive documentation, examples and user submitted notes.

When you’re using a completely new data store then documentation is important, and is one of the reasons why I chose MongoDB back in 2009.

Operations – Community

MongoDB Sao Paulo

MongoDB has to be a case study in how to build a community around a product. There have been almost 40 MongoDB conferences in the last year, a very active mailing list, and user groups around the world. You know you’re well known when a phrase like “web scale” is associated with your product (as a parody). Again, this is because there is a company behind the product actively promoting it and encouraging and managing these events.

Cassandra has had 1 conference in that time, and whilst there are user groups (I presented this talk at the London one) it’s certainly not on the same scale as MongoDB.

Does that matter? None of that existed when we chose MongoDB so we learnt everything ourselves. But for new users today, there’s a huge forum of people who are using MongoDB and are sharing their knowledge freely and easily accessible.

Operations/Usage – Drivers

MongoDB Drivers

The other main reason I chose MongoDB was the driver support. All the key drivers for MongoDB were available and most importantly, maintained by 10gen themselves. MongoDB has official drivers for C, C#, C++, Erlang, Javascript, Java, Perl, PHP, Python, Ruby and Scala. All fully supported.

The Python and PHP drivers were most important to us but we also use the C# driver in our Windows monitoring agent and to have these well maintained just like the core server makes a massive difference.

Cassandra only has official Java and Python drivers with a few others written by 3rd parties. I’ve found that Python is usually well catered for when it comes to libraries that work well. PHP is another story and we’ve had issues with RabbitMQ and ZeroMQ in the past (specifically not working well under heavy load; they all work fine for playing around). Good PHP libraries are hard to come by.

Conclusion

There is no conclusion. This post isn’t about which is best, it’s about comparing the two. Both have advantages and disadvantages and to truly compare you need to run them both in production under significant load for a long period of time. MongoDB has worked well for us and has proven itself at scale and to have flexibility to do things like building a queueing system as well as be the main data store for our server monitoring service.

For me, the operational considerations play a major part in making a decision because these types of databases are so new. I would suspect they’re also important to companies looking to adopt this technology. We don’t need a support contract for Apache, for example, because it’s so well proven. Our support contract with 10gen has been well worth the money!

Other references

  • http://www.datastax.com Jonathan Ellis

    Thanks for taking a stab at this, David!

    I can help clear some things up on the Cassandra side.

    - The closest thing to a relational row (or a MongoDB document) in Cassandra is the row, not the column. In other words, Table:row ~ Collection:document ~ ColumnFamily:row.
    - I’m not sure what “Cassandra … allowed indexes only through the column families mechanism” is supposed to mean. But other than that, the indexing information is correct; I think your source knew what he was talking about. :)
    - There’s no need to perpetutate the “Java = bloat” fallacy. Yes, the JVM allocates a lot of memory on startup; that’s just an artifact of how modern garbage collection works.
    - Speaking of memory, it’s worth pointing out that Cassandra deals with larger-than-memory datasets (or indexes) much better than MongoDB. (
    http://www.slideshare.net/emiltamas/scaling-with-mongo-db-with-notes/10)
    - As Jake’s video demonstrates, Cassandra’s fully-distributed design means it’s vastly simpler to set up and operate than MongoDB in a clustered setup. In a distributed system, simplicity is directly related to reliability: with four types of nodes (config servers, mongos, mongod masters and mongod slaves) there are a lot more failure scenarios to worry about with MongoDB, some of which can take out your cluster entirely.
    - Cassandra has nothing corresponding to the global write lock that makes real-world MongoDB performance much worse than suggested by simple bechmarks.
    - I’m surprised that it didn’t make your list of operational considerations that Cassandra doesn’t need to take a node offline (!) to clean up fragmentation.
    - You’re right, most Cassandra deployments are on Linux. I suspect this is because Linux is simply easier to manage at the scale of dozens or hundreds of machines; however, Cassandra does run fine on Windows and is in production in that environment as well.
    - I’m not sure what you’re thinking of when you say that Cassandra queries are configured at the node level. Perhaps you’re referring to node location information? It stands to reason that a node can only exist in one datacenter at a time. ConsistencyLevel, however, covered by Ben in the excellent presentation you linked, IS per-query.
    - I’ll have to disagree with your implication that DataStax is not an adequate source of support for Cassandra simply because Cassandra’s membership at Apache means that others can contribute to it on an equal footing with us. Of course, I may be biased. But I think Red Hat and others might disagree with that line of reasoning, too.
    - The 40:1 conferences calculation is highly misleading; the linked list of “mongodb conferences” includes database-agnostic conferences like OSCON and DrupalCon that also included talks on Cassandra. If you compare apples to apples in slideshare decks, mailing list activity, or job listings, it’s hard to avoid the conclusion that both databases have active, vibrant communities.
    - Cassandra’s next-gen CQL drivers were first released just a month ago (http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language); Ruby, PHP, and node.js drivers are already far along. In the meantime, as you pointed out there are plenty of options for the old Thrift-based API, which will continue to be supported.

    • http://www.serverdensity.com David Mytton

      I’m not sure what “Cassandra … allowed indexes only through the column families mechanism” is supposed to mean.

      This means unlike in MongoDB where you can create an index on the collection level and all documents with that field will be indexed (so you can just do an ad-hoc query against them), Cassandra requires you use the column families to define your indexes. i.e. it’s more involved and requires more thought about the structure.

      with four types of nodes (config servers, mongos, mongod masters and mongod slaves) there are a lot more failure scenarios to worry about with MongoDB, some of which can take out your cluster entirely.

      If you follow the documentation failure of any of these will not take out your cluster. Failure of config servers causes admin activity to pause i.e. rebalancing data, which isn’t a major problem. You are supposed to have mongos on each application node and should have several of these anyway for redundancy, and the same with mongod which should be set up in replica sets. mongod master and slave are exactly the same in terms of setup, there’s no difference except one will be master at a given time. Indeed there are actually only 2 binaries to run – mongod for everything (including the config server) and mongos as the router process.

      I’m surprised that it didn’t make your list of operational considerations that Cassandra doesn’t need to take a node offline (!) to clean up fragmentation.

      Yes, this is annoying but Mongo handles deletes/removes much better than Cassandra[1] so this isn’t as big of a problem. It used to be in older versions but has improved.

      I’ll have to disagree with your implication that DataStax is not an adequate source of support for Cassandra simply because Cassandra’s membership at Apache means that others can contribute to it on an equal footing with us. Of course, I may be biased. But I think Red Hat and others might disagree with that line of reasoning, too.

      I wasn’t implying that it’s not an adequate source rather it’s better to have a source with full control of the project so you don’t get caught up with project politics / governance issues (specific to open source) but also have a full engineering team dedicated to working on the project, as is the case with 10gen.

      You can’t compare Cassandra/MongoDB to Red Hat. The organisation sizes are totally different, as are the projects – a database product vs an entire OS.

      The 40:1 conferences calculation is highly misleading; the linked list of “mongodb conferences” includes database-agnostic conferences like OSCON and DrupalCon that also included talks on Cassandra.

      It’s not misleading. Whilst there are more general conferences such as OSCON and DrupalCon, if you look at the full list the majority are Mongo specific. As of now there are 31 Mongo conferences on that page.

      [1] https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

  • Ion

    May I know how many records you have in the MySQL before you moved to Mongo ?