You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Scott McCarty <sc...@metajinx.com> on 2011/01/19 21:17:55 UTC

Basic question on distributed delete

I've been searching on wikis and FAQs for a definitive answer to this and
haven't found it yet so I thought I'd ask people here.

We have a 5-node cluster set up with a replication factor of 3.  We're doing
write operations (using batch mutates that include deletions) with a QUORUM
consistency level.  For sake of discussion, let's say that the nodes are
always up and fully operational and responsive.

When we do a delete on a column in the above configuration, the call to the
server won't return until 2 of the 3 replicas are written to unless there's
an error.  That part is well-documented and understood.  The question I have
is whether or not the last of the 3 replica nodes gets the delete request
(which would result in a tombstone being added to it).  Does the code finish
up writing to the rest of the replicas in a background thread?

What I've read implies that the 3 replicas will get the delete request but
it's not explicit.

If there's no guarantee that the 3rd replica will get it (assuming that
there aren't node problems that would preclude it from receiving the delete
request), then that implies to me that I should follow each delete with a
read so a ReadRepair will pick up the discrepancy and fix it (or do a
nodetool repair).  Or am I missing something?

An underlying question is if anything special needs to be done when nothing
"out of the ordinary" happens to the cluster (e.g., node goes down, network
problems, etc.) in order to have eventual consistency.

Thanks,
  Scott

Re: Basic question on distributed delete

Posted by Jonathan Ellis <jb...@gmail.com>.
Even without machines being down, the guarantee Cassandra gives you is
exactly the ConsistencyLevel.  Everything else is best-effort.  In
particular if you hammer Cassandra with more requests than it can
fulfill, it will drop some instead of, say, running out of memory
trying to queue them all up.

On Wed, Jan 19, 2011 at 3:26 PM, Scott McCarty <sc...@pana.ma> wrote:
> Okay, this helps.  Cassandra works as I expected in the theoretically "pure"
> case (writing to the rest of the replicas in a background thread).
> I asked the question because we've been struggling to understand why we're
> seeing inconsistencies when we haven't had nodes go down, etc.  (However,
> even though they've not gone down, I definitely can't say that there haven't
> been communication hiccups that might account for some of the
> inconsistencies.)
> We're definitely now aware that we need to run nodetool repair periodically,
> but we're seeing some apparently permanent inconsistencies when we haven't
> had any known node failures, so I wanted to check on my basic understanding
> of the distributed delete operation.
> Thanks,
>   Scott
> On Wed, Jan 19, 2011 at 2:05 PM, Peter Schuller
> <pe...@infidyne.com> wrote:
>>
>> > When we do a delete on a column in the above configuration, the call to
>> > the
>> > server won't return until 2 of the 3 replicas are written to unless
>> > there's
>> > an error.  That part is well-documented and understood.  The question I
>> > have
>> > is whether or not the last of the 3 replica nodes gets the delete
>> > request
>> > (which would result in a tombstone being added to it).  Does the code
>> > finish
>> > up writing to the rest of the replicas in a background thread?
>>
>> Yes it does. But this is only strictly true in the theoretical
>> unrealistic case that absolutely all nodes are absolutely always up
>> and working and there are no glitches etc.
>>
>> > What I've read implies that the 3 replicas will get the delete request
>> > but
>> > it's not explicit.
>> > If there's no guarantee that the 3rd replica will get it (assuming that
>> > there aren't node problems that would preclude it from receiving the
>> > delete
>> > request), then that implies to me that I should follow each delete with
>> > a
>> > read so a ReadRepair will pick up the discrepancy and fix it (or do a
>> > nodetool repair).  Or am I missing something?
>>
>> No need for that. The QUORUM consistency level is about just that -
>> consistency, in terms of what you are or are not guaranteed to see on
>> a subsequent read that is also on QUORUM. It is not meant to affect
>> the number of nodes that are to receive the data. That is always
>> determined by the replication strategy, but for each individual
>> request will be effectively filtered for real-life concerns (i.e., RPC
>> messages timing out, nodes being marked as down, etc).
>>
>> > An underlying question is if anything special needs to be done when
>> > nothing
>> > "out of the ordinary" happens to the cluster (e.g., node goes down,
>> > network
>> > problems, etc.) in order to have eventual consistency.
>>
>> Unless your situation is special and you really know what you're
>> doing, the critical point is that you should run 'nodetool repair'
>> often enough relative to GCGraceSeconds:
>>
>>   http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>>
>> --
>> / Peter Schuller
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Basic question on distributed delete

Posted by Scott McCarty <sc...@pana.ma>.
Okay, this helps.  Cassandra works as I expected in the theoretically "pure"
case (writing to the rest of the replicas in a background thread).

I asked the question because we've been struggling to understand why we're
seeing inconsistencies when we haven't had nodes go down, etc.  (However,
even though they've not gone down, I definitely can't say that there haven't
been communication hiccups that might account for some of the
inconsistencies.)

We're definitely now aware that we need to run nodetool repair periodically,
but we're seeing some apparently permanent inconsistencies when we haven't
had any known node failures, so I wanted to check on my basic understanding
of the distributed delete operation.

Thanks,
  Scott

On Wed, Jan 19, 2011 at 2:05 PM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > When we do a delete on a column in the above configuration, the call to
> the
> > server won't return until 2 of the 3 replicas are written to unless
> there's
> > an error.  That part is well-documented and understood.  The question I
> have
> > is whether or not the last of the 3 replica nodes gets the delete request
> > (which would result in a tombstone being added to it).  Does the code
> finish
> > up writing to the rest of the replicas in a background thread?
>
> Yes it does. But this is only strictly true in the theoretical
> unrealistic case that absolutely all nodes are absolutely always up
> and working and there are no glitches etc.
>
> > What I've read implies that the 3 replicas will get the delete request
> but
> > it's not explicit.
> > If there's no guarantee that the 3rd replica will get it (assuming that
> > there aren't node problems that would preclude it from receiving the
> delete
> > request), then that implies to me that I should follow each delete with a
> > read so a ReadRepair will pick up the discrepancy and fix it (or do a
> > nodetool repair).  Or am I missing something?
>
> No need for that. The QUORUM consistency level is about just that -
> consistency, in terms of what you are or are not guaranteed to see on
> a subsequent read that is also on QUORUM. It is not meant to affect
> the number of nodes that are to receive the data. That is always
> determined by the replication strategy, but for each individual
> request will be effectively filtered for real-life concerns (i.e., RPC
> messages timing out, nodes being marked as down, etc).
>
> > An underlying question is if anything special needs to be done when
> nothing
> > "out of the ordinary" happens to the cluster (e.g., node goes down,
> network
> > problems, etc.) in order to have eventual consistency.
>
> Unless your situation is special and you really know what you're
> doing, the critical point is that you should run 'nodetool repair'
> often enough relative to GCGraceSeconds:
>
>   http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>
> --
> / Peter Schuller
>

Re: Basic question on distributed delete

Posted by Peter Schuller <pe...@infidyne.com>.
> When we do a delete on a column in the above configuration, the call to the
> server won't return until 2 of the 3 replicas are written to unless there's
> an error.  That part is well-documented and understood.  The question I have
> is whether or not the last of the 3 replica nodes gets the delete request
> (which would result in a tombstone being added to it).  Does the code finish
> up writing to the rest of the replicas in a background thread?

Yes it does. But this is only strictly true in the theoretical
unrealistic case that absolutely all nodes are absolutely always up
and working and there are no glitches etc.

> What I've read implies that the 3 replicas will get the delete request but
> it's not explicit.
> If there's no guarantee that the 3rd replica will get it (assuming that
> there aren't node problems that would preclude it from receiving the delete
> request), then that implies to me that I should follow each delete with a
> read so a ReadRepair will pick up the discrepancy and fix it (or do a
> nodetool repair).  Or am I missing something?

No need for that. The QUORUM consistency level is about just that -
consistency, in terms of what you are or are not guaranteed to see on
a subsequent read that is also on QUORUM. It is not meant to affect
the number of nodes that are to receive the data. That is always
determined by the replication strategy, but for each individual
request will be effectively filtered for real-life concerns (i.e., RPC
messages timing out, nodes being marked as down, etc).

> An underlying question is if anything special needs to be done when nothing
> "out of the ordinary" happens to the cluster (e.g., node goes down, network
> problems, etc.) in order to have eventual consistency.

Unless your situation is special and you really know what you're
doing, the critical point is that you should run 'nodetool repair'
often enough relative to GCGraceSeconds:

   http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair

-- 
/ Peter Schuller