You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Adam Fisk <a...@littleshoot.org> on 2009/11/22 00:50:22 UTC

cassandra over hbase

I'm trying to navigate the rapidly shifting tides in NoSQL land, and
I'm particularly struggling with using Cassandra versus HBase. They
functionally seem quite similar to me even if the implementations are
quite different.

What would people on the list say are the primary reasons to use
Cassandra over HBase? HA and speed are very important for my
application. HBase's tighter integration with Hadoop and therefore
easier reporting and analytics using M/R appeals to me, but I
intuitively prefer the Cassandra community and generally like the
architectural approach. HBase's Hadoop foundations also strike me as
both an advantage and a disadvantage, as it seems to tie their hands a
bit.

Thanks for any advice you can give!

-Adam

-- 
Adam Fisk
http://www.littleshoot.org | http://adamfisk.wordpress.com |
http://twitter.com/adamfisk

Re: cassandra over hbase

Posted by Adam Fisk <a...@littleshoot.org>.

Thanks for all the helpful responses, everyone. I've honestly been
going back and forth a lot with this decision, but it's surprising how
much of a difference the usability of Cassandra from an install and
interface perspective really makes, even for techies like us. The
HBase command line throws all sorts of scary exceptions even when it's
really working.

It surprised me how much of a difference Cassandra's quick setup makes
for a company on a tight deadline - not at all to imply Cassandra
can't go toe to toe with HBase on the merits of the internals - more
to say that extra effort is well worth it in terms of building the
Cassandra community.

Nice work, and thanks again!

-Adam


On Tue, Nov 24, 2009 at 10:56 AM, Stu Hood <st...@rackspace.com> wrote:
>> JR> After chatting with some Facebook guys, we realized that one potential
>> JR> benefit from using HDFS is that the recovery from losing partial data in a
>> JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
>> JR> can quickly rebuild the blocks on the failed disk in parallel.
>
> HDFS replicates eagerly, which means that having a node down for longer than a timeout period will also mean that you do more work than you needed. Cassandra replicates (very) lazily, and I prefer laziness for the sake of efficiency.
>
>> JR> So, when this happens, the whole node probably has to be taken out
>> JR> and bootstrapped. The same problem exists when a single sstable file
>> JR> is corrupted.
>> I think recovering a single sstable is a useful thing, and it seems like
>> a better problem to solve.
>
> This is why we need to get #193 in. Going to the filesystem and deleting/fuzzing an SSTable on a node and then running a repair will cause a new SSTable to be created  that overlays and reapairs the first based on the data from the other nodes.
>
> Thanks,
> Stu
>
> -----Original Message-----
> From: "Ted Zlatanov" <tz...@lifelogs.com>
> Sent: Tuesday, November 24, 2009 8:40am
> To: cassandra-user@incubator.apache.org
> Subject: Re: cassandra over hbase
>
> On Mon, 23 Nov 2009 11:58:08 -0800 Jun Rao <ju...@almaden.ibm.com> wrote:
>
> JR> After chatting with some Facebook guys, we realized that one potential
> JR> benefit from using HDFS is that the recovery from losing partial data in a
> JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
> JR> can quickly rebuild the blocks on the failed disk in parallel. This is a
> JR> bit hard to do in cassandra, since we can't easily find the data on the
> JR> failed disk from another node.
>
> This is an architectural issue, right?  IIUC Cassandra simply doesn't
> care about disks.  I think that's a plus, actually, because it
> simplifies the code and filesystems in my experience are better left up
> to the OS.  For instance, we're evaluating Lustre and for many specific
> reasons it's significantly better for our needs than HDFS, so HDFS would
> be a tough sell.
>
> JR> So, when this happens, the whole node probably has to be taken out
> JR> and bootstrapped. The same problem exists when a single sstable file
> JR> is corrupted.
>
> I think recovering a single sstable is a useful thing, and it seems like
> a better problem to solve.
>
> Ted
>
>
>
>



-- 
Adam Fisk
http://www.littleshoot.org | http://adamfisk.wordpress.com |
http://twitter.com/adamfisk

Re: cassandra over hbase

Posted by Stu Hood <st...@rackspace.com>.

> JR> After chatting with some Facebook guys, we realized that one potential
> JR> benefit from using HDFS is that the recovery from losing partial data in a
> JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
> JR> can quickly rebuild the blocks on the failed disk in parallel.

HDFS replicates eagerly, which means that having a node down for longer than a timeout period will also mean that you do more work than you needed. Cassandra replicates (very) lazily, and I prefer laziness for the sake of efficiency.

> JR> So, when this happens, the whole node probably has to be taken out
> JR> and bootstrapped. The same problem exists when a single sstable file
> JR> is corrupted.
> I think recovering a single sstable is a useful thing, and it seems like
> a better problem to solve.

This is why we need to get #193 in. Going to the filesystem and deleting/fuzzing an SSTable on a node and then running a repair will cause a new SSTable to be created  that overlays and reapairs the first based on the data from the other nodes.

Thanks,
Stu

-----Original Message-----
From: "Ted Zlatanov" <tz...@lifelogs.com>
Sent: Tuesday, November 24, 2009 8:40am
To: cassandra-user@incubator.apache.org
Subject: Re: cassandra over hbase

On Mon, 23 Nov 2009 11:58:08 -0800 Jun Rao <ju...@almaden.ibm.com> wrote: 

JR> After chatting with some Facebook guys, we realized that one potential
JR> benefit from using HDFS is that the recovery from losing partial data in a
JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
JR> can quickly rebuild the blocks on the failed disk in parallel. This is a
JR> bit hard to do in cassandra, since we can't easily find the data on the
JR> failed disk from another node. 

This is an architectural issue, right?  IIUC Cassandra simply doesn't
care about disks.  I think that's a plus, actually, because it
simplifies the code and filesystems in my experience are better left up
to the OS.  For instance, we're evaluating Lustre and for many specific
reasons it's significantly better for our needs than HDFS, so HDFS would
be a tough sell.

JR> So, when this happens, the whole node probably has to be taken out
JR> and bootstrapped. The same problem exists when a single sstable file
JR> is corrupted.

I think recovering a single sstable is a useful thing, and it seems like
a better problem to solve.

Ted

Re: cassandra over hbase

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Mon, 23 Nov 2009 11:58:08 -0800 Jun Rao <ju...@almaden.ibm.com> wrote: 

JR> After chatting with some Facebook guys, we realized that one potential
JR> benefit from using HDFS is that the recovery from losing partial data in a
JR> node is more efficient. Suppose that one lost a single disk at a node. HDFS
JR> can quickly rebuild the blocks on the failed disk in parallel. This is a
JR> bit hard to do in cassandra, since we can't easily find the data on the
JR> failed disk from another node. 

This is an architectural issue, right?  IIUC Cassandra simply doesn't
care about disks.  I think that's a plus, actually, because it
simplifies the code and filesystems in my experience are better left up
to the OS.  For instance, we're evaluating Lustre and for many specific
reasons it's significantly better for our needs than HDFS, so HDFS would
be a tough sell.

JR> So, when this happens, the whole node probably has to be taken out
JR> and bootstrapped. The same problem exists when a single sstable file
JR> is corrupted.

I think recovering a single sstable is a useful thing, and it seems like
a better problem to solve.

Ted

Re: cassandra over hbase

Posted by Jun Rao <ju...@almaden.ibm.com>.

After chatting with some Facebook guys, we realized that one potential
benefit from using HDFS is that the recovery from losing partial data in a
node is more efficient. Suppose that one lost a single disk at a node. HDFS
can quickly rebuild the blocks on the failed disk in parallel. This is a
bit hard to do in cassandra, since we can't easily find the data on the
failed disk from another node. So, when this happens, the whole node
probably has to be taken out and bootstrapped. The same problem exists when
a single sstable file is corrupted.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


adamfisk@gmail.com wrote on 11/21/2009 03:50:22 PM:

> [image removed]
>
> cassandra over hbase
>
> Adam Fisk
>
> to:
>
> cassandra-user
>
> 11/21/2009 03:51 PM
>
> Sent by:
>
> adamfisk@gmail.com
>
> Please respond to cassandra-user
>
>
> I'm trying to navigate the rapidly shifting tides in NoSQL land, and
> I'm particularly struggling with using Cassandra versus HBase. They
> functionally seem quite similar to me even if the implementations are
> quite different.
>
> What would people on the list say are the primary reasons to use
> Cassandra over HBase? HA and speed are very important for my
> application. HBase's tighter integration with Hadoop and therefore
> easier reporting and analytics using M/R appeals to me, but I
> intuitively prefer the Cassandra community and generally like the
> architectural approach. HBase's Hadoop foundations also strike me as
> both an advantage and a disadvantage, as it seems to tie their hands a
> bit.
>
> Thanks for any advice you can give!
>
> -Adam
>
> --
> Adam Fisk
> http://www.littleshoot.org | http://adamfisk.wordpress.com |
> http://twitter.com/adamfisk

Re: cassandra over hbase

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Nov 23, 2009 at 10:02 AM, Eric Evans <ee...@rackspace.com> wrote:
>> What would people on the list say are the primary reasons to use
>> Cassandra over HBase? HA and speed are very important for my
>> application. HBase's tighter integration with Hadoop and therefore
>> easier reporting and analytics using M/R appeals to me, but I
>> intuitively prefer the Cassandra community and generally like the
>> architectural approach. HBase's Hadoop foundations also strike me as
>> both an advantage and a disadvantage, as it seems to tie their hands a
>> bit.
>
> For myself it would be:
>
> * The flexibility to choose between consistency and availability.
> * No single points of failure, (every node is identical).
> * Linear scalability (i.e 20 nodes gives you 2x what 10 does, etc).

I would add that "every node is identical" is a huge win in monitoring
and troubleshooting as well.

Other reasons to prefer Cassandra include clusters spanning multiple
data centers, and at the API level, Cassandra provides row slicing and
customizable CompareWith.

-Jonathan

Re: cassandra over hbase

Posted by Eric Evans <ee...@rackspace.com>.

On Sat, 2009-11-21 at 15:50 -0800, Adam Fisk wrote:
> I'm trying to navigate the rapidly shifting tides in NoSQL land, and
> I'm particularly struggling with using Cassandra versus HBase. They
> functionally seem quite similar to me even if the implementations are
> quite different.
> 
> What would people on the list say are the primary reasons to use
> Cassandra over HBase? HA and speed are very important for my
> application. HBase's tighter integration with Hadoop and therefore
> easier reporting and analytics using M/R appeals to me, but I
> intuitively prefer the Cassandra community and generally like the
> architectural approach. HBase's Hadoop foundations also strike me as
> both an advantage and a disadvantage, as it seems to tie their hands a
> bit.

For myself it would be:

* The flexibility to choose between consistency and availability.
* No single points of failure, (every node is identical).
* Linear scalability (i.e 20 nodes gives you 2x what 10 does, etc).

There are some comparisons out there, some more reasonable than others,
I recommend this one:

http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/

-- 
Eric Evans
eevans@rackspace.com