You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jason Baker <ja...@apture.com> on 2011/07/07 03:29:47 UTC

Running hadoop jobs against data in remote data center

I'm just setting up a Cassandra cluster for my company.  For a variety of
reasons, we have the servers that run our hadoop jobs in our local office
and our production machines in a collocated data center.  We don't want to
run hadoop jobs against cassandra servers on the other side of the US from
us, not to mention that we don't want them impacting performance in
production.  What's the best way to handle this?

My first instinct is to add some servers locally to the node and use
NetworkTopologyStrategy.  This way, the servers automatically get updated
with the latest changes, and we get a bit of extra redundancy for our
production machine.  Of course, the glaring weakness of this strategy is
that our stats servers aren't in a datacenter with any kind of production
guarantees.  The network connection is relatively slow and unreliable, the
servers may go out at any time, and I generally don't want to tie our
production performance or reliability to these servers.

Is this as dumb an idea as I suspect it is, or can this be made to work?
 :-)

Are there any better ways to accomplish what I'm trying to accomplish?

Re: Running hadoop jobs against data in remote data center

Posted by Aaron Morton <aa...@thelastpickle.com>.
See http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers and http://www.datastax.com/docs/0.8/brisk/about_brisk#about-the-brisk-architecture

It's possible to run multi DC and use LOCAL_QUORUM consistency level in your production centre to allow the prod code to get on with it's life without worrying about the other DC.

Hope that helps.


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 7/07/2011, at 1:29 PM, Jason Baker <ja...@apture.com> wrote:

> I'm just setting up a Cassandra cluster for my company.  For a variety of reasons, we have the servers that run our hadoop jobs in our local office and our production machines in a collocated data center.  We don't want to run hadoop jobs against cassandra servers on the other side of the US from us, not to mention that we don't want them impacting performance in production.  What's the best way to handle this?
> 
> My first instinct is to add some servers locally to the node and use NetworkTopologyStrategy.  This way, the servers automatically get updated with the latest changes, and we get a bit of extra redundancy for our production machine.  Of course, the glaring weakness of this strategy is that our stats servers aren't in a datacenter with any kind of production guarantees.  The network connection is relatively slow and unreliable, the servers may go out at any time, and I generally don't want to tie our production performance or reliability to these servers.
> 
> Is this as dumb an idea as I suspect it is, or can this be made to work?  :-)
> 
> Are there any better ways to accomplish what I'm trying to accomplish?