You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Russ Brown <pi...@gmail.com> on 2010/07/29 01:11:24 UTC

Evaluating Cassandra for our use case

Hi,

I'm currently looking at NoSQL solutions to replace a bespoke system
that we currently have in place. Currently I think the best fit is
Cassandra, but I would like to get some feedback from those who know
it better before spending more time on it.

Our current system is geared to allowing our web servers to operate
very quickly and completely independently (for most pages) of other
servers. This is accomplished by keeping chunks of data about "things"
on each machine's disk with a file per entity. The key in this is
effectively the filename, with the value being the file's content. A
central server handles the initial generation (and subsequent updates)
of these files, and distribution to the web servers is carried out by
a combination of network share mounting and shell scripts.

The system *does* work: the servers are very fast and they do work
fine when the servers behind them disappear. However, the storage and
transport mechanisms are cumbersome, and we would like to see if there
are suitable alternatives available.

The idea is to replace the disk-based storage on each server with a
NoSQL solution using replication to handle the transport automatically
for us. What we need is:

 * One "master", though being able to have a backup for it that we
could quickly bring into play would be advantageous
 * Each "slave" must have a full copy of the data
 * It does not matter if the slaves do not get updates immediately or
at exactly the same time, as long as they get there quickly
 * Reads must be fast (though understandably it will probably be
slower than reading a system-cached file direct from disk)
 * It would be a bonus if the slaves could be written to too, with the
writes making their way to the other nodes. This is probably a given,
but I thought I'd mention it anyway.

Now, I have read a few things about Cassandra's read performance which
is what has got me a bit worried. However, I have also read quite a
bit about its flexibility in terms of topology, and that the read
performance is very much dependent on how things are set up. For
example, a lot of what I've read describes how when querying a node it
will ask other nodes for information, which it then collates and
returns. Is it possible to configure Cassandra in such a way that a
node only every asks itself for the data, and if so what sort of
effect will that have on read performance? Our current solution is
designed to avoid having to hit the network, so doing the same here
would be advantageous.

I have also read that Cassandra will distribute data between different
nodes, while we want all to have a full copy of all data. Is it
possible to configure Cassandra in this way?

If this will work, it will be a heck of a lot cleaner and easier to
maintain than the current solution, so we're quite hopeful. :)

Feedback appreciated,

-- 

Russ

RE: Evaluating Cassandra for our use case

Posted by Daniel Kluesing <dk...@bluekai.com>.
>Is it possible to configure Cassandra in such a way that a
>node only every asks itself for the data, and if so what sort of
>effect will that have on read performance?

Check out the RingCache class which lets you make your clients smart enough to ask the right server. (Also, if all nodes have all the data like you mention below, and you have your read consistency set to 1, you won't ask the network nodes.)

>I have also read that Cassandra will distribute data between different
>nodes, while we want all to have a full copy of all data. Is it
>possible to configure Cassandra in this way?

If you set the replication factor to the number of nodes, then every node will have a full copy. (That might get sticky if you add new servers, since I don't think you can change the replication factor once set)

-----Original Message-----
From: Russ Brown [mailto:pickscrape@gmail.com] 
Sent: Wednesday, July 28, 2010 4:11 PM
To: user@cassandra.apache.org
Subject: Evaluating Cassandra for our use case

Hi,

I'm currently looking at NoSQL solutions to replace a bespoke system
that we currently have in place. Currently I think the best fit is
Cassandra, but I would like to get some feedback from those who know
it better before spending more time on it.

Our current system is geared to allowing our web servers to operate
very quickly and completely independently (for most pages) of other
servers. This is accomplished by keeping chunks of data about "things"
on each machine's disk with a file per entity. The key in this is
effectively the filename, with the value being the file's content. A
central server handles the initial generation (and subsequent updates)
of these files, and distribution to the web servers is carried out by
a combination of network share mounting and shell scripts.

The system *does* work: the servers are very fast and they do work
fine when the servers behind them disappear. However, the storage and
transport mechanisms are cumbersome, and we would like to see if there
are suitable alternatives available.

The idea is to replace the disk-based storage on each server with a
NoSQL solution using replication to handle the transport automatically
for us. What we need is:

 * One "master", though being able to have a backup for it that we
could quickly bring into play would be advantageous
 * Each "slave" must have a full copy of the data
 * It does not matter if the slaves do not get updates immediately or
at exactly the same time, as long as they get there quickly
 * Reads must be fast (though understandably it will probably be
slower than reading a system-cached file direct from disk)
 * It would be a bonus if the slaves could be written to too, with the
writes making their way to the other nodes. This is probably a given,
but I thought I'd mention it anyway.

Now, I have read a few things about Cassandra's read performance which
is what has got me a bit worried. However, I have also read quite a
bit about its flexibility in terms of topology, and that the read
performance is very much dependent on how things are set up. For
example, a lot of what I've read describes how when querying a node it
will ask other nodes for information, which it then collates and
returns. Is it possible to configure Cassandra in such a way that a
node only every asks itself for the data, and if so what sort of
effect will that have on read performance? Our current solution is
designed to avoid having to hit the network, so doing the same here
would be advantageous.

I have also read that Cassandra will distribute data between different
nodes, while we want all to have a full copy of all data. Is it
possible to configure Cassandra in this way?

If this will work, it will be a heck of a lot cleaner and easier to
maintain than the current solution, so we're quite hopeful. :)

Feedback appreciated,

-- 

Russ

Re: Evaluating Cassandra for our use case

Posted by Aaron Morton <aa...@thelastpickle.com>.
> Thanks for this, Aaron. It does actually look like Redis may be better
> suited to our needs. I had originally discounted Redis because I had
> the impression that it had volatile storage only, but now I see that
> not to be the case.
>
> Thanks again!
 
Yup, you've got Append Only, foreground  Snap Shot and background snapshot in there.

I strongly recommend following the Redis creator on Twitter if you start playing with it, is a pretty fast moving project at times http://twitter.com/antirez

Aaron



Re: Evaluating Cassandra for our use case

Posted by Russ Brown <pi...@gmail.com>.
On Wed, Jul 28, 2010 at 9:13 PM, Aaron Morton <aa...@thelastpickle.com> wrote:
> Have you considered Redis http://code.google.com/p/redis/?
>
> It may be more suited to the master-slave configuration you are after.
>
> - You can have a master to write to, then slave to a slave master, then your
> web heads run a local redis and slave from the slave master.
> - Backup at the master or the slave master
> - Writes to the write master would make their way to the web head slave.
> - Web heads only read from their local slave.
> - Reads will be all in memory and faster than disk
> - Redis can store a lot of data in memory and also use disk
> (http://blogzawodny.com/2010/07/24/200000000-keys-in-redis-2-0-0-rc3/)
> - Web heads would have to write to the master, not locally
>
> It sounds like your thinking of running a cassandra node on each web head
> with full replication and only reading locally, I'm not sure if this is the
> best use case. Would like to know what others think. I would imagine you
> would get better over all up time and performance from running cassandra as
> a cluster separate from the web heads, with less than full replication.
>

Thanks for this, Aaron. It does actually look like Redis may be better
suited to our needs. I had originally discounted Redis because I had
the impression that it had volatile storage only, but now I see that
not to be the case.

Thanks again!

> Aaron
>
>
>
>
> On 29 Jul, 2010,at 11:11 AM, Russ Brown <pi...@gmail.com> wrote:
>
> Hi,
>
> I'm currently looking at NoSQL solutions to replace a bespoke system
> that we currently have in place. Currently I think the best fit is
> Cassandra, but I would like to get some feedback from those who know
> it better before spending more time on it.
>
> Our current system is geared to allowing our web servers to operate
> very quickly and completely independently (for most pages) of other
> servers. This is accomplished by keeping chunks of data about "things"
> on each machine's disk with a file per entity. The key in this is
> effectively the filename, with the value being the file's content. A
> central server handles the initial generation (and subsequent updates)
> of these files, and distribution to the web servers is carried out by
> a combination of network share mounting and shell scripts.
>
> The system *does* work: the servers are very fast and they do work
> fine when the servers behind them disappear. However, the storage and
> transport mechanisms are cumbersome, and we would like to see if there
> are suitable alternatives available.
>
> The idea is to replace the disk-based storage on each server with a
> NoSQL solution using replication to handle the transport automatically
> for us. What we need is:
>
> * One "master", though being able to have a backup for it that we
> could quickly bring into play would be advantageous
> * Each "slave" must have a full copy of the data
> * It does not matter if the slaves do not get updates immediately or
> at exactly the same time, as long as they get there quickly
> * Reads must be fast (though understandably it will probably be
> slower than reading a system-cached file direct from disk)
> * It would be a bonus if the slaves could be written to too, with the
> writes making their way to the other nodes. This is probably a given,
> but I thought I'd mention it anyway.
>
> Now, I have read a few things about Cassandra's read performance which
> is what has got me a bit worried. However, I have also read quite a
> bit about its flexibility in terms of topology, and that the read
> performance is very much dependent on how things are set up. For
> example, a lot of what I've read describes how when querying a node it
> will ask other nodes for information, which it then collates and
> returns. Is it possible to configure Cassandra in such a way that a
> node only every asks itself for the data, and if so what sort of
> effect will that have on read performance? Our current solution is
> designed to avoid having to hit the network, so doing the same here
> would be advantageous.
>
> I have also read that Cassandra will distribute data between different
> nodes, while we want all to have a full copy of all data. Is it
> possible to configure Cassandra in this way?
>
> If this will work, it will be a heck of a lot cleaner and easier to
> maintain than the current solution, so we're quite hopeful. :)
>
> Feedback appreciated,
>
> --
>
> Russ
>



-- 

Russ

Re: Evaluating Cassandra for our use case

Posted by Aaron Morton <aa...@thelastpickle.com>.
Have you considered Redis http://code.google.com/p/redis/?

It may be more suited to the master-slave configuration you are after.

- You can have a master to write to, then slave to a slave master, then your web heads run a local redis and slave from the slave master.
- Backup at the master or the slave master
- Writes to the write master would make their way to the web head slave.
- Web heads only read from their local slave
- Reads will be all in memory and faster than disk
- Redis can store a lot of data in memory and also use disk (http://blog.zawodny.com/2010/07/24/200000000-keys-in-redis-2-0-0-rc3/)
- Web heads would have to write to the master, not locally

It sounds like your thinking of running a cassandra node on each web head with full replication and only reading locally, I'm not sure if this is the best use case. Would like to know what others think. I would imagine you would get better over all up time and performance from running cassandra as a cluster separate from the web heads, with less than full replication. 

Aaron




On 29 Jul, 2010,at 11:11 AM, Russ Brown <pi...@gmail.com> wrote:

> Hi,
>
> I'm currently looking at NoSQL solutions to replace a bespoke system
> that we currently have in place. Currently I think the best fit is
> Cassandra, but I would like to get some feedback from those who know
> it better before spending more time on it.
>
> Our current system is geared to allowing our web servers to operate
> very quickly and completely independently (for most pages) of other
> servers. This is accomplished by keeping chunks of data about "things"
> on each machine's disk with a file per entity. The key in this is
> effectively the filename, with the value being the file's content. A
> central server handles the initial generation (and subsequent updates)
> of these files, and distribution to the web servers is carried out by
> a combination of network share mounting and shell scripts.
>
> The system *does* work: the servers are very fast and they do work
> fine when the servers behind them disappear. However, the storage and
> transport mechanisms are cumbersome, and we would like to see if there
> are suitable alternatives available.
>
> The idea is to replace the disk-based storage on each server with a
> NoSQL solution using replication to handle the transport automatically
> for us. What we need is:
>
> * One "master", though being able to have a backup for it that we
> could quickly bring into play would be advantageous
> * Each "slave" must have a full copy of the data
> * It does not matter if the slaves do not get updates immediately or
> at exactly the same time, as long as they get there quickly
> * Reads must be fast (though understandably it will probably be
> slower than reading a system-cached file direct from disk)
> * It would be a bonus if the slaves could be written to too, with the
> writes making their way to the other nodes. This is probably a given,
> but I thought I'd mention it anyway.
>
> Now, I have read a few things about Cassandra's read performance which
> is what has got me a bit worried. However, I have also read quite a
> bit about its flexibility in terms of topology, and that the read
> performance is very much dependent on how things are set up. For
> example, a lot of what I've read describes how when querying a node it
> will ask other nodes for information, which it then collates and
> returns. Is it possible to configure Cassandra in such a way that a
> node only every asks itself for the data, and if so what sort of
> effect will that have on read performance? Our current solution is
> designed to avoid having to hit the network, so doing the same here
> would be advantageous.
>
> I have also read that Cassandra will distribute data between different
> nodes, while we want all to have a full copy of all data. Is it
> possible to configure Cassandra in this way?
>
> If this will work, it will be a heck of a lot cleaner and easier to
> maintain than the current solution, so we're quite hopeful. :)
>
> Feedback appreciated,
>
> -- 
>
> Russ