You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by mcasandra <mo...@gmail.com> on 2011/02/15 01:45:19 UTC

Data distribution

Couple of questions:

1) If I insert a key and want to verify which node it went to then how do I
do that?
2) How can I verify if the replication is working. That is how do I check
that CF row got inserted in 2 nodes if replication factor is set to 2.
3) What happens if I just update the keyspace and change the replication
factor say 2 to 3. Would cassandra automatically replicate the old data to
the 3rd node?
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6025869.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Data distribution

Posted by mcasandra <mo...@gmail.com>.

HH is one aspect and the other aspect is when new node join there need to be
some balancing that need to occur, this may take time as well.

But I also understand it will add lot of complexity in the code.

Is there any place where I can read other things of concern that one should
be aware of?
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6030157.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Data distribution

Posted by Robert Coli <rc...@digg.com>.

On Tue, Feb 15, 2011 at 3:05 PM, mcasandra <mo...@gmail.com> wrote:
>
> Is there a way to let the new node join cluster in the background and make it
> live to clients only after it has finished with node repair, syncing data
> etc. and in the end sync keys or trees that's needed before it's come to
> life. I know it can be tricky since it needs to be live as soon as it steals
> the keys.

In general, no. This sort of thing has been proposed a few times, in
different contexts, and has not been implemented.

https://issues.apache.org/jira/browse/CASSANDRA-768

=Rob

Re: Data distribution

Posted by mcasandra <mo...@gmail.com>.

Thanks! Would Hector take care of not load balancing to the new node until
it's ready?

Also, when repair is occuring in background is there a status that I can
look at to see that repair is occuring for key ABC.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6029882.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Data distribution

Posted by Matthew Dennis <md...@datastax.com>.

Assuming you aren't changing the RC, the normal bootstrap process takes care
of all the problems like that, making sure things work correctly.

Most importantly, if something fails (either the new node or any of the
existing nodes) you can recover from it.

Just don't connect clients directly to that new node until it's fully in the
ring.

On Tue, Feb 15, 2011 at 5:05 PM, mcasandra <mo...@gmail.com> wrote:

>
> Is there a way to let the new node join cluster in the background and make
> it
> live to clients only after it has finished with node repair, syncing data
> etc. and in the end sync keys or trees that's needed before it's come to
> life. I know it can be tricky since it needs to be live as soon as it
> steals
> the keys.
>
> This ways we know we are adding nodes only when we think it's all ready.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6029708.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: Data distribution

Posted by mcasandra <mo...@gmail.com>.

Is there a way to let the new node join cluster in the background and make it
live to clients only after it has finished with node repair, syncing data
etc. and in the end sync keys or trees that's needed before it's come to
life. I know it can be tricky since it needs to be live as soon as it steals
the keys.

This ways we know we are adding nodes only when we think it's all ready.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6029708.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Data distribution

Posted by Matthew Dennis <md...@datastax.com>.

regardless of increasing RF or not, RR happens based on the
read_repair_chance setting.  RR happens after the request has been replied
to though, so it's possible that if you increase the RF and then read that
the read might get stale/missing data.  RR would then put the correct value
on all the correct nodes so future reads would see the correct data, but the
initial read might not.

If you are already reading at CL.ONE then after increasing the RF you need
to read at CL.Q to maintain the same consistency.  If you're reading at CL.Q
or CL.ALL, then after increasing the RF you need to read at CL.ALL to
maintain the same consistency.  You have to do this until all the nodes are
consistent again.  If you depend on RR only this time is unbounded.  If you
run repair, then after repair the repair is complete you can go back to your
original CL.

tl;dr run nodetool repair after increasing the RF

On Mon, Feb 14, 2011 at 7:52 PM, mcasandra <mo...@gmail.com> wrote:

>
> When I increase the replication factor does the repair happen automatically
> in background when client first tries to access data from the node where
> data does not exist.
>
> Or the nodetool repair need to run after increasing the replication factor.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6025972.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

RE: Data distribution

Posted by mcasandra <mo...@gmail.com>.

When I increase the replication factor does the repair happen automatically
in background when client first tries to access data from the node where
data does not exist.

Or the nodetool repair need to run after increasing the replication factor.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distribution-tp6025869p6025972.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Data distribution

Posted by Matthew Dennis <md...@datastax.com>.

On Mon, Feb 14, 2011 at 6:58 PM, Dan Hendry <da...@gmail.com>wrote:

> > 1) If I insert a key and want to verify which node it went to then how do
> I
> > do that?
>
> I don't think you can and there should be no reason to care. Cassandra
> abstracts where data is being stored, think in terms of consistency levels
> not actual nodes.
>

If you actually mean which nodes a particular write went to, you'd have to
ask each node that it's supposed to belong on or watch the logs while on
debug.  If you just want to know which nodes it's *supposed* to go to, JMX
exposes getNaturalEndpoints and thrift exposes describe_ring. describe_ring
will give the token ranges for each node and then you can fit your key
inside those ranges and get a list of the nodes that key belongs on.

>
> > 2) How can I verify if the replication is working. That is how do I check
> > that CF row got inserted in 2 nodes if replication factor is set to 2.
>
> Perform a successful write at consistency level ALL via a thrift client.
>

Perform a successful write at CL.ALL will work.  You can also write at
whatever CL you want, then look in describe_ring to see where the data
should have ended up, then ask each node individually at CL.ONE to see if
that node has the data.  If you do this, make sure Read Repair (RR) is off,
otherwise the first read you do might repair it on another node and that
only tells you RR is working, not that the initially replication worked.

RE: Data distribution

Posted by Dan Hendry <da...@gmail.com>.

> 1) If I insert a key and want to verify which node it went to then how do
I
> do that?

I don't think you can and there should be no reason to care. Cassandra
abstracts where data is being stored, think in terms of consistency levels
not actual nodes. 

> 2) How can I verify if the replication is working. That is how do I check
> that CF row got inserted in 2 nodes if replication factor is set to 2.

Perform a successful write at consistency level ALL via a thrift client.

> 3) What happens if I just update the keyspace and change the replication
> factor say 2 to 3. Would cassandra automatically replicate the old data to
> the 3rd node?

http://wiki.apache.org/cassandra/Operations#Replication

"but increasing it may be done if you (a) read at ConsistencyLevel.QUORUM or
ALL (depending on your existing replication factor) to make sure that a
replica that actually has the data is consulted, (b) are willing to accept
downtime while anti-entropy repair runs (see below), or (c) are willing to
live with some clients potentially being told no data exists if they read
from the new replica location(s) until repair is done."

Dan

-----Original Message-----
From: mcasandra [mailto:mohitanchlia@gmail.com] 
Sent: February-14-11 19:45
To: cassandra-user@incubator.apache.org
Subject: Data distribution


Couple of questions:

1) If I insert a key and want to verify which node it went to then how do I
do that?
2) How can I verify if the replication is working. That is how do I check
that CF row got inserted in 2 nodes if replication factor is set to 2.
3) What happens if I just update the keyspace and change the replication
factor say 2 to 3. Would cassandra automatically replicate the old data to
the 3rd node?
-- 
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-distri
bution-tp6025869p6025869.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at
Nabble.com.
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.872 / Virus Database: 271.1.1/3441 - Release Date: 02/14/11
02:34:00