You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Jun Rao <ju...@almaden.ibm.com> on 2009/04/01 19:41:08 UTC

Re: handling deletes

I am wondering if this is part of the bigger issue on data consistency.

Following your example: a row x is replicated to node A, B, and C. C goes
down. A and B delete x. When C comes back, C should contact other nodes
that hold hinted handoff data intended for C. So, in theory, the missing
deletion of x will be propagated to C at some point and not lost. However,
the problem is that those hinted handoff nodes can die before the handoff
completes. Then C need some other way to sync itself up. Node A and B are
the only possible sources. Unfortunately, data in A and B are accumulated
independently from C, and therefore syncing them up is a bit challenging.

In the short run, I am not sure if I really like the solution you suggested
here. However, I don't have a better solution either.

In the long run, maybe we should look into peer-to-peer replication
techniques, instead of relying on hinted handoff. In P2P replication, an
update can be directed to any replica, which will try to push it to its
peers. The push will be almost real time if the peers are up. If a peer is
down, changes for it will be accumulated and re-pushed when it's up again.
Because an update is always initiated from one replica, it's easier to sync
up the replicas through log shipping. The benefit of this approach is that
(1) easier reasoning about data synchronization/consistency (still have to
be careful about deletes though); (2) potentially less overhead since we
don't have to do read repairs all the time (anybody know how much overhead
it introduces?); (3) almost the same availability on writes as Cassandra
today.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com



                                                                           
             Jonathan Ellis                                                
             <jbellis@gmail.co                                             
             m>                                                         To 
                                       cassandra-dev@incubator.apache.org  
             03/30/2009 03:19                                           cc 
             PM                                                            
                                                                   Subject 
                                       handling deletes                    
             Please respond to                                             
             cassandra-dev@inc                                             
             ubator.apache.org                                             
                                                                           
                                                                           
                                                                           





Avinash pointed out two bugs in my remove code.  One is easy to fix,
the other is tougher.

The easy one is that my code removes tombstones (deletion markers) at
the ColumnFamilyStore level, so when CassandraServer does read repair
it will not know about the tombstones and they will not be replicated
correctly.  This can be fixed by simply moving the removeDeleted call
up to just before CassandraServer's final return-to-client.

The hard one is that tombstones are problematic on GC (that is, major
compaction of SSTables, to use the Bigtable paper terminology).

One failure scenario: Node A, B, and C replicate some data.  C goes
down.  The data is deleted.  A and B delete it and later GC it.  C
comes back up.  C now has the only copy of the data so on read repair
the stale data will be sent to A and B.

A solution: pick a number N such that we are confident that no node
will be down (and catch up on hinted handoffs) for longer than N days.
 (Default value: 10?)  Then, no node may GC tombstones before N days
have elapsed.  Also, after N days, tombstones will no longer be read
repaired.  (This prevents a node which has not yet GC'd from sending a
new tombstone copy to a node that has already GC'd.)

Implementation detail: we'll need to add a 32-bit "time of tombstone"
to ColumnFamily and SuperColumn.  (For Column we can stick it in the
byte[] value, since we already have an unambiguous way to know if the
Column is in a deleted state.)  We only need 32 bits since the time
frame here is sufficiently granular that we don't need ms.  Also, we
will use the system clock for these values, not the client timestamp,
since we don't know what the source of the client timestamps is.

Admittedly this is suboptimal compared to being able to GC immediately
but it has the virtue of being (a) easily implemented, (b) with no
extra components such as a coordination protocol, and (c) better than
not GCing tombstones at all (the other easy way to ensure
correctness).

Thoughts?

-Jonathan

Re: handling deletes

Posted by Zhu Han <sc...@gmail.com>.

Yes, most service has a SLA. If the possibility of such extreme case can be
controlled with the SLA, it's OK if we dont' provide any mechanisms on it.

best regards,
hanzhu


On Thu, Apr 2, 2009 at 3:37 AM, Avinash Lakshman <avinash.lakshman@gmail.com
> wrote:

> In reality from 2 years of production experience in Dynamo and here with
> Cassandra it is not as extreme as it seems :). Options are either strong
> consistency which is hard to get right in a distributed setting. If you do
> get it right then there is availability problem. All tools like read-repair
> etc help in achieving eventual consistency. So I guess it boils down to
> what
> you want from your app C or A.
>
> Avinash
>
> On Wed, Apr 1, 2009 at 11:46 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> >
> > My reply is inlined below.
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
> >
> >
> > Jonathan Ellis <jb...@gmail.com> wrote on 04/01/2009 10:50:37 AM:
> >
> > >
> > > On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao <ju...@almaden.ibm.com>
> wrote:
> > > > I am wondering if this is part of the bigger issue on data
> consistency.
> > > >
> > > > Following your example: a row x is replicated to node A, B, and C. C
> > goes
> > > > down. A and B delete x. When C comes back, C should contact other
> nodes
> > that
> > > > hold hinted handoff data intended for C. So, in theory, the missing
> > deletion
> > > > of x will be propagated to C at some point and not lost. However, the
> > > > problem is that those hinted handoff nodes can die before the handoff
> > > > completes. Then C need some other way to sync itself up. Node A and B
> > are
> > > > the only possible sources. Unfortunately, data in A and B are
> > accumulated
> > > > independently from C, and therefore syncing them up is a bit
> > challenging.
> > >
> > > Right.  Or you could have a network partition when C comes back up
> > > preventing the handoff.  There's lots of things that can go wrong.
> > > Hence the "eventual" part of "eventually consistent." :)
> > >
> > > > In the short run, I am not sure if I really like the solution you
> > suggested
> > > > here. However, I don't have a better solution either.
> > >
> > > Like I said; it's not perfect, but it's better than the alternatives
> > > I've seen.  I'd much rather have an imperfect solution than none at
> > > all.
> > >
> > > > In the long run, maybe we should look into peer-to-peer replication
> > > > techniques, instead of relying on hinted handoff. In P2P replication,
> > an
> > > > update can be directed to any replica, which will try to push it to
> its
> > > > peers. The push will be almost real time if the peers are up. If a
> peer
> > is
> > > > down, changes for it will be accumulated and re-pushed when it's up
> > again.
> > > > Because an update is always initiated from one replica, it's easier
> to
> > sync
> > > > up the replicas through log shipping.
> > >
> > > There's a huge amount of complexity you're glossing over, though: what
> > > if the replica responsible for the initiation goes down?  Then you
> > > have to elect a new one.  This is (a) very complicated and (b) causes
> > > loss of availability.  I prefer the existing system.  (If you want
> > > consistency over availability then hbase or hypertable is a better
> > > choice since that is what they design for.)
> >
> > P2P replication definitely adds complexity and it is just one of the
> > alternatives. However, there is also complexity in hinted handoff + read
> > repair + merkle tree (when it's added). Not sure which one is more
> > complicated. In P2P replication, since you can initiate a write on any
> > replica, you just need to pick a live replica for writes. As for
> > availability, a lot have to do with how quickly a failed node is
> detected.
> > Today, if you write to a node that's actually failed, but not yet
> detected
> > by Cassandra, the write will also fail.
> >
> > Overall, I think eventual consistency is fine. However, eventual
> > consistency probably shouldn't be equated to updates taking forever to
> show
> > up. Some sort of guarantee on how outdated a piece of data is will likely
> > be useful to many applications.
> >
> > >
> > > -Jonathan
>

Re: handling deletes

Posted by Avinash Lakshman <av...@gmail.com>.

In reality from 2 years of production experience in Dynamo and here with
Cassandra it is not as extreme as it seems :). Options are either strong
consistency which is hard to get right in a distributed setting. If you do
get it right then there is availability problem. All tools like read-repair
etc help in achieving eventual consistency. So I guess it boils down to what
you want from your app C or A.

Avinash

On Wed, Apr 1, 2009 at 11:46 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

>
> My reply is inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Jonathan Ellis <jb...@gmail.com> wrote on 04/01/2009 10:50:37 AM:
>
> >
> > On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> > > I am wondering if this is part of the bigger issue on data consistency.
> > >
> > > Following your example: a row x is replicated to node A, B, and C. C
> goes
> > > down. A and B delete x. When C comes back, C should contact other nodes
> that
> > > hold hinted handoff data intended for C. So, in theory, the missing
> deletion
> > > of x will be propagated to C at some point and not lost. However, the
> > > problem is that those hinted handoff nodes can die before the handoff
> > > completes. Then C need some other way to sync itself up. Node A and B
> are
> > > the only possible sources. Unfortunately, data in A and B are
> accumulated
> > > independently from C, and therefore syncing them up is a bit
> challenging.
> >
> > Right.  Or you could have a network partition when C comes back up
> > preventing the handoff.  There's lots of things that can go wrong.
> > Hence the "eventual" part of "eventually consistent." :)
> >
> > > In the short run, I am not sure if I really like the solution you
> suggested
> > > here. However, I don't have a better solution either.
> >
> > Like I said; it's not perfect, but it's better than the alternatives
> > I've seen.  I'd much rather have an imperfect solution than none at
> > all.
> >
> > > In the long run, maybe we should look into peer-to-peer replication
> > > techniques, instead of relying on hinted handoff. In P2P replication,
> an
> > > update can be directed to any replica, which will try to push it to its
> > > peers. The push will be almost real time if the peers are up. If a peer
> is
> > > down, changes for it will be accumulated and re-pushed when it's up
> again.
> > > Because an update is always initiated from one replica, it's easier to
> sync
> > > up the replicas through log shipping.
> >
> > There's a huge amount of complexity you're glossing over, though: what
> > if the replica responsible for the initiation goes down?  Then you
> > have to elect a new one.  This is (a) very complicated and (b) causes
> > loss of availability.  I prefer the existing system.  (If you want
> > consistency over availability then hbase or hypertable is a better
> > choice since that is what they design for.)
>
> P2P replication definitely adds complexity and it is just one of the
> alternatives. However, there is also complexity in hinted handoff + read
> repair + merkle tree (when it's added). Not sure which one is more
> complicated. In P2P replication, since you can initiate a write on any
> replica, you just need to pick a live replica for writes. As for
> availability, a lot have to do with how quickly a failed node is detected.
> Today, if you write to a node that's actually failed, but not yet detected
> by Cassandra, the write will also fail.
>
> Overall, I think eventual consistency is fine. However, eventual
> consistency probably shouldn't be equated to updates taking forever to show
> up. Some sort of guarantee on how outdated a piece of data is will likely
> be useful to many applications.
>
> >
> > -Jonathan

Re: handling deletes

Posted by Jun Rao <ju...@almaden.ibm.com>.

My reply is inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Jonathan Ellis <jb...@gmail.com> wrote on 04/01/2009 10:50:37 AM:

>
> On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> > I am wondering if this is part of the bigger issue on data consistency.
> >
> > Following your example: a row x is replicated to node A, B, and C. C
goes
> > down. A and B delete x. When C comes back, C should contact other nodes
that
> > hold hinted handoff data intended for C. So, in theory, the missing
deletion
> > of x will be propagated to C at some point and not lost. However, the
> > problem is that those hinted handoff nodes can die before the handoff
> > completes. Then C need some other way to sync itself up. Node A and B
are
> > the only possible sources. Unfortunately, data in A and B are
accumulated
> > independently from C, and therefore syncing them up is a bit
challenging.
>
> Right.  Or you could have a network partition when C comes back up
> preventing the handoff.  There's lots of things that can go wrong.
> Hence the "eventual" part of "eventually consistent." :)
>
> > In the short run, I am not sure if I really like the solution you
suggested
> > here. However, I don't have a better solution either.
>
> Like I said; it's not perfect, but it's better than the alternatives
> I've seen.  I'd much rather have an imperfect solution than none at
> all.
>
> > In the long run, maybe we should look into peer-to-peer replication
> > techniques, instead of relying on hinted handoff. In P2P replication,
an
> > update can be directed to any replica, which will try to push it to its
> > peers. The push will be almost real time if the peers are up. If a peer
is
> > down, changes for it will be accumulated and re-pushed when it's up
again.
> > Because an update is always initiated from one replica, it's easier to
sync
> > up the replicas through log shipping.
>
> There's a huge amount of complexity you're glossing over, though: what
> if the replica responsible for the initiation goes down?  Then you
> have to elect a new one.  This is (a) very complicated and (b) causes
> loss of availability.  I prefer the existing system.  (If you want
> consistency over availability then hbase or hypertable is a better
> choice since that is what they design for.)

P2P replication definitely adds complexity and it is just one of the
alternatives. However, there is also complexity in hinted handoff + read
repair + merkle tree (when it's added). Not sure which one is more
complicated. In P2P replication, since you can initiate a write on any
replica, you just need to pick a live replica for writes. As for
availability, a lot have to do with how quickly a failed node is detected.
Today, if you write to a node that's actually failed, but not yet detected
by Cassandra, the write will also fail.

Overall, I think eventual consistency is fine. However, eventual
consistency probably shouldn't be equated to updates taking forever to show
up. Some sort of guarantee on how outdated a piece of data is will likely
be useful to many applications.

>
> -Jonathan

Re: handling deletes

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> I am wondering if this is part of the bigger issue on data consistency.
>
> Following your example: a row x is replicated to node A, B, and C. C goes
> down. A and B delete x. When C comes back, C should contact other nodes that
> hold hinted handoff data intended for C. So, in theory, the missing deletion
> of x will be propagated to C at some point and not lost. However, the
> problem is that those hinted handoff nodes can die before the handoff
> completes. Then C need some other way to sync itself up. Node A and B are
> the only possible sources. Unfortunately, data in A and B are accumulated
> independently from C, and therefore syncing them up is a bit challenging.

Right.  Or you could have a network partition when C comes back up
preventing the handoff.  There's lots of things that can go wrong.
Hence the "eventual" part of "eventually consistent." :)

> In the short run, I am not sure if I really like the solution you suggested
> here. However, I don't have a better solution either.

Like I said; it's not perfect, but it's better than the alternatives
I've seen.  I'd much rather have an imperfect solution than none at
all.

> In the long run, maybe we should look into peer-to-peer replication
> techniques, instead of relying on hinted handoff. In P2P replication, an
> update can be directed to any replica, which will try to push it to its
> peers. The push will be almost real time if the peers are up. If a peer is
> down, changes for it will be accumulated and re-pushed when it's up again.
> Because an update is always initiated from one replica, it's easier to sync
> up the replicas through log shipping.

There's a huge amount of complexity you're glossing over, though: what
if the replica responsible for the initiation goes down?  Then you
have to elect a new one.  This is (a) very complicated and (b) causes
loss of availability.  I prefer the existing system.  (If you want
consistency over availability then hbase or hypertable is a better
choice since that is what they design for.)

-Jonathan

Re: handling deletes

Posted by Zhu Han <sc...@gmail.com>.

Jun,

What you mentioned is really a good point here. There are two assumptions we
have to say before discussing the problem.

1) We suppose most of the failures/crashes is transient.  Even the server is
crashed, it can be made alive again without loss of any persistent data on
the disk. Tha'ts why cassandra uses a write-ahead log style database to host
all persistent data on the local disk. So Hinted handOff is good enough for
most problems.

2) If the hinted handoff nodes are crashed forever, we need some mechanism
to sync up with other replicas of the same range, just as you pointed out.

As the possibility of 2) is very small, we may consider some type of manual
effort or not so elegant but simple solution. For example, let an
application to start an read-repair process can sync the data on different
nodes. Let a node to do a whole checkpoint and then start a synchronization
by comparing the checkpoint.

If you want to have a smart solution, building a hash
tree<http://en.wikipedia.org/wiki/Hash_tree>over each range and
synchonization over the hash tree is really very good.
Since the possibility of 2) is very rare,  you can start the process  from
the console manually.

Shipping the log periodically is expensive if the write load of the nodes
are very high.


best regards,
hanzhu


On Thu, Apr 2, 2009 at 1:41 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

> I am wondering if this is part of the bigger issue on data consistency.
>
> Following your example: a row x is replicated to node A, B, and C. C goes
> down. A and B delete x. When C comes back, C should contact other nodes that
> hold hinted handoff data intended for C. So, in theory, the missing deletion
> of x will be propagated to C at some point and not lost. However, the
> problem is that those hinted handoff nodes can die before the handoff
> completes. Then C need some other way to sync itself up. Node A and B are
> the only possible sources. Unfortunately, data in A and B are accumulated
> independently from C, and therefore syncing them up is a bit challenging.
>
> In the short run, I am not sure if I really like the solution you suggested
> here. However, I don't have a better solution either.
>
> In the long run, maybe we should look into peer-to-peer replication
> techniques, instead of relying on hinted handoff. In P2P replication, an
> update can be directed to any replica, which will try to push it to its
> peers. The push will be almost real time if the peers are up. If a peer is
> down, changes for it will be accumulated and re-pushed when it's up again.
> Because an update is always initiated from one replica, it's easier to sync
> up the replicas through log shipping. The benefit of this approach is that
> (1) easier reasoning about data synchronization/consistency (still have to
> be careful about deletes though); (2) potentially less overhead since we
> don't have to do read repairs all the time (anybody know how much overhead
> it introduces?); (3) almost the same availability on writes as Cassandra
> today.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> junrao@almaden.ibm.com
>
>
> [image: Inactive hide details for Jonathan Ellis <jb...@gmail.com>]Jonathan
> Ellis <jb...@gmail.com>
>
>
>
>     *Jonathan Ellis <jb...@gmail.com>*
>
>             03/30/2009 03:19 PM
>             Please respond to
>             cassandra-dev@incubator.apache.org
>
>
> To
>
> cassandra-dev@incubator.apache.org
> cc
>
>
> Subject
>
> handling deletes
>
>
> Avinash pointed out two bugs in my remove code.  One is easy to fix,
> the other is tougher.
>
> The easy one is that my code removes tombstones (deletion markers) at
> the ColumnFamilyStore level, so when CassandraServer does read repair
> it will not know about the tombstones and they will not be replicated
> correctly.  This can be fixed by simply moving the removeDeleted call
> up to just before CassandraServer's final return-to-client.
>
> The hard one is that tombstones are problematic on GC (that is, major
> compaction of SSTables, to use the Bigtable paper terminology).
>
> One failure scenario: Node A, B, and C replicate some data.  C goes
> down.  The data is deleted.  A and B delete it and later GC it.  C
> comes back up.  C now has the only copy of the data so on read repair
> the stale data will be sent to A and B.
>
> A solution: pick a number N such that we are confident that no node
> will be down (and catch up on hinted handoffs) for longer than N days.
> (Default value: 10?)  Then, no node may GC tombstones before N days
> have elapsed.  Also, after N days, tombstones will no longer be read
> repaired.  (This prevents a node which has not yet GC'd from sending a
> new tombstone copy to a node that has already GC'd.)
>
> Implementation detail: we'll need to add a 32-bit "time of tombstone"
> to ColumnFamily and SuperColumn.  (For Column we can stick it in the
> byte[] value, since we already have an unambiguous way to know if the
> Column is in a deleted state.)  We only need 32 bits since the time
> frame here is sufficiently granular that we don't need ms.  Also, we
> will use the system clock for these values, not the client timestamp,
> since we don't know what the source of the client timestamps is.
>
> Admittedly this is suboptimal compared to being able to GC immediately
> but it has the virtue of being (a) easily implemented, (b) with no
> extra components such as a coordination protocol, and (c) better than
> not GCing tombstones at all (the other easy way to ensure
> correctness).
>
> Thoughts?
>
> -Jonathan
>
>