You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Colin MacDonald <co...@sas.com> on 2013/12/18 10:20:47 UTC

Setting up Cassandra to store on a specific node and not replicate

Ahoy the list.  I am evaluating Cassandra in the context of using it as a storage back end for the Titan graph database.

We’ll have several nodes in the cluster.  However, one of our requirements is that data has to be loaded into and stored on a specific node and only on that node.  Also, it cannot be replicated around the system, at least not stored persistently on disk – we will of course make copies in memory and on the wire as we access remote notes.  These requirements are non-negotiable.

We understand that this is essentially the opposite of what Cassandra is designed for, and that we’re missing all the scalability and robustness, but is it technically possible?

First, I would need to create a custom partitioner – is there any tutorial on that?  I see a few “you don’t need” to threads, but I do.

Second, how easy is it to have Cassandra not replicate data between nodes in a cluster?  I’m not seeing an obvious configuration option for that, presumably because it obviates much of the point of using Cassandra, but again, we’re working within some rather unfortunate constraints.

Any hints or suggestions would be most gratefully received.

Kind regards,

-Colin MacDonald-

RE: Setting up Cassandra to store on a specific node and not replicate

Posted by Colin MacDonald <co...@sas.com>.

> -----Original Message-----
> From: Sylvain Lebresne [mailto:sylvain@datastax.com]
> Sent: 18 December 2013 12:46
> Google up NetworkTopologyStrategy. This is what you want to use and it's
> not configured in cassandra.yaml but when you create the keyspace.
> 
> Basically, you define your topology in cassandra-topology.yaml (where you
> basically manually set which node is in which DC, which you can really just
> see as assigning nodes to named groups) and then you can define the
> replication factor for each DC (so if RF=1 on the 1 node group and 0 on the
> "other nodes" group, C* will gladly honor it and store no data on node of the
> "other nodes" group).
> 
> --
> Sylvain

Thank you so much, that's clear and helpful.  I appreciate you taking the time to explain it.

-Colin-

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Sylvain Lebresne <sy...@datastax.com>.

Google up NetworkTopologyStrategy. This is what you want to use and it's
not configured in cassandra.yaml but when you create the keyspace.

Basically, you define your topology in cassandra-topology.yaml (where you
basically manually set which node is in which DC, which you can really just
see as assigning nodes to named groups) and then you can define the
replication factor for each DC (so if RF=1 on the 1 node group and 0 on the
"other nodes" group, C* will gladly honor it and store no data on node of
the "other nodes" group).

--
Sylvain


On Wed, Dec 18, 2013 at 1:32 PM, Colin MacDonald <co...@sas.com>wrote:

> > -----Original Message-----
> > From: Sylvain Lebresne [mailto:sylvain@datastax.com]
> > Sent: 18 December 2013 10:45
> >
> > You seem to be well aware that you're not looking at using Cassandra for
> > what it is designed for (which obviously imply you'll need to expect
> under-
> > optimal behavior), so I'm not going to insist on it.
>
> Very kind of you. ;)
>
> I'm suspect that that this requirement is viscerally horrifying, but as I
> said, it's idiosyncratic, specified by an... idiosyncrat.
>
> It's a pragmatic solution that I'm looking for, just to get a proof of
> concept going, it doesn't have to be elegant at this stage.
>
> > As to how you could achieve that, a relatively simple solution (that do
> not
> > require writing your own partitioner) would consist in using 2
> datacenters
> > (that obviously don't have to be real physical datacenter), to put the
> one that
> > should have it all in one datacenter with RF=1 and to pull all other
> nodes in
> > the other datacenter with RF=0.
> >
> > As Janne said, you could still have hint being written by other nodes if
> the
> > one storage node is dead, but you can use the system property
> > cassandra.maxHintTTL to 0 to disable hints.
>
> Thanks Sylvain, I'll look into that.  I'm coming to Cassandra cold, I
> hadn't even spotted that the replication factor was configurable - I don't
> see an option for in the cassandra.yaml that came with 2.0.2.  I should be
> able to figure it out though, and that's great news, it looks like it takes
> care of one issue.
>
> However, I'm not immediately seeing how to control which node will get the
> single copy of the data.  Won't the partitioner still allocate data around
> the cluster?
>
> Ah, is a "datacentre" a logical group *within* an overall cluster?  So I
> can create a separate "datacentre" for each node, and if I write to that
> node the data will be forced to stay in that datacentre, i.e. that node?
>
> I do apologise for the noobish questions, my attention is currently split
> between investigating several possible solutions.  I rather favour
> Cassandra though, if I can hobble it appropriately.
>
> Kind regards,
>
> -Colin MacDonald-
>
>

RE: Setting up Cassandra to store on a specific node and not replicate

Posted by Colin MacDonald <co...@sas.com>.

> -----Original Message-----
> From: Sylvain Lebresne [mailto:sylvain@datastax.com]
> Sent: 18 December 2013 10:45
> 
> You seem to be well aware that you're not looking at using Cassandra for
> what it is designed for (which obviously imply you'll need to expect under-
> optimal behavior), so I'm not going to insist on it.

Very kind of you. ;)

I'm suspect that that this requirement is viscerally horrifying, but as I said, it's idiosyncratic, specified by an... idiosyncrat.

It's a pragmatic solution that I'm looking for, just to get a proof of concept going, it doesn't have to be elegant at this stage.

> As to how you could achieve that, a relatively simple solution (that do not
> require writing your own partitioner) would consist in using 2 datacenters
> (that obviously don't have to be real physical datacenter), to put the one that
> should have it all in one datacenter with RF=1 and to pull all other nodes in
> the other datacenter with RF=0.
> 
> As Janne said, you could still have hint being written by other nodes if the
> one storage node is dead, but you can use the system property
> cassandra.maxHintTTL to 0 to disable hints.

Thanks Sylvain, I'll look into that.  I'm coming to Cassandra cold, I hadn't even spotted that the replication factor was configurable - I don't see an option for in the cassandra.yaml that came with 2.0.2.  I should be able to figure it out though, and that's great news, it looks like it takes care of one issue.

However, I'm not immediately seeing how to control which node will get the single copy of the data.  Won't the partitioner still allocate data around the cluster?

Ah, is a "datacentre" a logical group *within* an overall cluster?  So I can create a separate "datacentre" for each node, and if I write to that node the data will be forced to stay in that datacentre, i.e. that node?

I do apologise for the noobish questions, my attention is currently split between investigating several possible solutions.  I rather favour Cassandra though, if I can hobble it appropriately.

Kind regards,

-Colin MacDonald-

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Janne Jalkanen <ja...@ecyrd.com>.

Probably yes, if you also disabled any sort of failovers from the token-aware client…

(Talking about this makes you realize how many failsafes Cassandra has. And still you can lose data… :-P)

/Janne

On 18 Dec 2013, at 20:31, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Dec 18, 2013 at 2:44 AM, Sylvain Lebresne <sy...@datastax.com> wrote:
> As Janne said, you could still have hint being written by other nodes if the one storage node is dead, but you can use the system property cassandra.maxHintTTL to 0 to disable hints.
> 
> If one uses a Token Aware client with RF=1, that would seem to preclude hinting even without disabling HH for the entire system; if the coordinator is always the single replica, why would it send a copy anywhere else?
> 
> =Rob

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Wed, Dec 18, 2013 at 7:31 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Dec 18, 2013 at 2:44 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> As Janne said, you could still have hint being written by other nodes if
>> the one storage node is dead, but you can use the system
>> property cassandra.maxHintTTL to 0 to disable hints.
>>
>
> If one uses a Token Aware client with RF=1, that would seem to preclude
> hinting even without disabling HH for the entire system; if the coordinator
> is always the single replica, why would it send a copy anywhere else?
>

Colin explicitly said that he would several nodes and I said I wasn't going
to judge, so I implicitly assumed there was a reason for having multiple
nodes.

If you're going to always ever hit one node, then using a token aware
client is over-complicating it. Just use a one node cluster and you'll have
nothing to worry about or to configure.

That being said, Colin, do be aware that as far as I can tell there is
indeed relatively little benefit to having a multi-node cluster on which
all data is on one node (in particular, there is no cache at the
coordinator level, so that even if your client hit other nodes, everything
will still be forwarded to the one node that stores it all, the other nodes
won't store anything really, not even in memory).

--
Sylvain

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Dec 18, 2013 at 2:44 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> As Janne said, you could still have hint being written by other nodes if
> the one storage node is dead, but you can use the system
> property cassandra.maxHintTTL to 0 to disable hints.
>

If one uses a Token Aware client with RF=1, that would seem to preclude
hinting even without disabling HH for the entire system; if the coordinator
is always the single replica, why would it send a copy anywhere else?

=Rob

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Sylvain Lebresne <sy...@datastax.com>.

You seem to be well aware that you're not looking at using Cassandra for
what it is designed for (which obviously imply you'll need to expect
under-optimal behavior), so I'm not going to insist on it.

As to how you could achieve that, a relatively simple solution (that do not
require writing your own partitioner) would consist in using 2 datacenters
(that obviously don't have to be real physical datacenter), to put the one
that should have it all in one datacenter with RF=1 and to pull all other
nodes in the other datacenter with RF=0.

As Janne said, you could still have hint being written by other nodes if
the one storage node is dead, but you can use the system
property cassandra.maxHintTTL to 0 to disable hints.

--
Sylvain

On Wed, Dec 18, 2013 at 10:20 AM, Colin MacDonald
<co...@sas.com>wrote:

>  Ahoy the list.  I am evaluating Cassandra in the context of using it as
> a storage back end for the Titan graph database.
>
>
>
> We’ll have several nodes in the cluster.  However, one of our
> requirements is that data has to be loaded into and stored on a specific
> node and only on that node.  Also, it cannot be replicated around the
> system, at least not stored persistently on disk – we will of course make
> copies in memory and on the wire as we access remote notes.  These
> requirements are non-negotiable.
>
>
>
> We understand that this is essentially the opposite of what Cassandra is
> designed for, and that we’re missing all the scalability and robustness,
> but is it technically possible?
>
>
>
> First, I would need to create a custom partitioner – is there any
> tutorial on that?  I see a few “you don’t need” to threads, but I do.
>
>
>
> Second, how easy is it to have Cassandra not replicate data between nodes
> in a cluster?  I’m not seeing an obvious configuration option for that,
> presumably because it obviates much of the point of using Cassandra, but
> again, we’re working within some rather unfortunate constraints.
>
>
>
> Any hints or suggestions would be most gratefully received.
>
>
>
> Kind regards,
>
>
>
> -Colin MacDonald-
>
>
>

RE: Setting up Cassandra to store on a specific node and not replicate

Posted by Colin MacDonald <co...@sas.com>.

> -----Original Message-----
> From: Janne Jalkanen [mailto:janne.jalkanen@ecyrd.com]
> 
> Essentially you want to turn off all the features which make Cassandra a
> robust product ;-).

Oh, I don't want to, but sadly those are the requirements that I have to work with.

Again, the context is using it as the storage back for a graph database.  I'm currently looking at the Titan graph DBMS, which supports the use of Cassandra or HBase for a distributed graph, both of which will need to be hobbled to prevent them working the way they're designed.

So it really is a question of: *can* I cripple Cassandra in this way, and if so how?

Thanks for the response.

-Colin MacDonald-

Re: Setting up Cassandra to store on a specific node and not replicate

Posted by Janne Jalkanen <ja...@ecyrd.com>.

This may be hard because the coordinator could store hinted handoff (HH) data on disk. You could turn HH off and have RF=1 to keep data on a single instance, but you would be likely to lose data if you had any problems with your instances… Also you would need to tweak the memtable flushing so that it goes to disk more often than the ten seconds which is the default. Or lose data. You will also have an "interesting" time scaling your cluster and would have to plan for that in your custom database.

Essentially you want to turn off all the features which make Cassandra a robust product ;-). Without knowing your requirements more precisely, I'd be inclined to recommend manually sharding on MariaDB or Postgres instances instead, or use their underlying storage engines directly (e.g. InnoDB), if you're just looking for a key-value store.

/Janne

On 18 Dec 2013, at 11:20, Colin MacDonald <co...@sas.com> wrote:

> Ahoy the list.  I am evaluating Cassandra in the context of using it as a storage back end for the Titan graph database.
>  
> We’ll have several nodes in the cluster.  However, one of our requirements is that data has to be loaded into and stored on a specific node and only on that node.  Also, it cannot be replicated around the system, at least not stored persistently on disk – we will of course make copies in memory and on the wire as we access remote notes.  These requirements are non-negotiable.
>  
> We understand that this is essentially the opposite of what Cassandra is designed for, and that we’re missing all the scalability and robustness, but is it technically possible?
>  
> First, I would need to create a custom partitioner – is there any tutorial on that?  I see a few “you don’t need” to threads, but I do.
>  
> Second, how easy is it to have Cassandra not replicate data between nodes in a cluster?  I’m not seeing an obvious configuration option for that, presumably because it obviates much of the point of using Cassandra, but again, we’re working within some rather unfortunate constraints.
>  
> Any hints or suggestions would be most gratefully received.
>  
> Kind regards,
>  
> -Colin MacDonald-
>