You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by AJ <aj...@dude.podzone.net> on 2011/07/02 03:53:11 UTC

Strong Consistency with ONE read/writes

Is this possible?

All reads and writes for a given key will always go to the same node 
from a client.  It seems the only thing needed is to allow the clients 
to compute which node is the closes replica for the given key using the 
same algorithm C* uses.  When the first replica receives the write 
request, it will write to itself which should complete before any of the 
other replicas and then return.  The loads should still stay balanced if 
using random partitioner.  If the first replica becomes unavailable 
(however that is defined), then the clients can send to the next repilca 
in the ring and switch from ONE write/reads to QUORUM write/reads 
temporarily until the first replica becomes available again.  QUORUM is 
required since there could be some replicas that were not updated after 
the first replica went down.

Will this work?  The goal is to have strong consistency with a 
read/write consistency level as low as possible while secondarily a 
network performance boost.

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

On 7/2/2011 6:03 AM, William Oberman wrote:
> Ok, I see the "you happen to choose the 'right' node" idea, but it 
> sounds like you want to solve "C* problems" in the client, and they 
> already wrote that complicated code to make clients simple.   You're 
> talking about reimplementing key<->node mappings, network topology 
> (with failures), etc...  Plus, if they change something about 
> replication and you get too tricky, your code breaks.  Or, if they 
> optimize something, you might not benefit.
>

I'm only asking if this is possible working within the current design 
and architecture and if not, then why.  I'm not interested in a hack; 
just exploring possibilities.

Re: Strong Consistency with ONE read/writes

Posted by William Oberman <ob...@civicscience.com>.

Ok, I see the "you happen to choose the 'right' node" idea, but it sounds
like you want to solve "C* problems" in the client, and they already wrote
that complicated code to make clients simple.   You're talking about
reimplementing key<->node mappings, network topology (with failures), etc...
 Plus, if they change something about replication and you get too tricky,
your code breaks.  Or, if they optimize something, you might not benefit.

On Jul 1, 2011, at 10:33 PM, AJ <aj...@dude.podzone.net> wrote:

I'm saying I will make my clients forward the C* requests to the first
replica instead of forwarding to a random node.
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Will Oberman <ob...@civicscience.com> wrote:
>
>
>
> Sent from my iPhone
>
> On Jul 1, 2011, at 9:53 PM, AJ <aj...@dude.podzone.net> wrote:
>
> > Is this possible?
> >
> > All reads and writes for a given key will always go to the same node
> > from a client.
>
> I don't think that's true. Given a key K, the client will write to N
> nodes (N=replication factor). And at consistency level ONE the client
> will return after 1 "ack" (from the N writes).
>
>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

I'm saying I will make my clients forward the C* requests to the first replica instead of forwarding to a random node.
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Will Oberman <ob...@civicscience.com> wrote:

Sent from my iPhone

On Jul 1, 2011, at 9:53 PM, AJ <aj...@dude.podzone.net> wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node 
> from a client.

I don't think that's true. Given a key K, the client will write to N 
nodes (N=replication factor). And at consistency level ONE the client 
will return after 1 "ack" (from the N writes).

Re: Strong Consistency with ONE read/writes

Posted by Will Oberman <ob...@civicscience.com>.

Sent from my iPhone

On Jul 1, 2011, at 9:53 PM, AJ <aj...@dude.podzone.net> wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node  
> from a client.

I don't think that's true. Given a key K, the client will write to N  
nodes (N=replication factor). And at consistency level ONE the client  
will return after 1 "ack" (from the N writes).

Re: Strong Consistency with ONE read/writes

Posted by Edward Capriolo <ed...@gmail.com>.

On Sat, Jul 2, 2011 at 3:57 PM, Yang <te...@gmail.com> wrote:

>
> Jonathan:
>
> could you please elaborate more on specifically why they are "not even
> close"?
>  --- I kind of see what you mean (please correct me if I misunderstood):
> Cassandra failure detector
> is consulted on every write; while HBase failure detector is only used when
> the tablet server joins or leaves.
>
>  in order to have the single write entry point approach originally brought
> up in this thread,
> I think you need a strong membership protocol to lock on the key range
>  leadership, once leadership is acquired,
> failure detectors do not need to be consulted on every write.
>
> yes by definition of the original requirement brought up in this thread,
> Cassandra's write behavior is going to be changed, to be more like Hbase,
> and mongo in "replica set" mode. but
> it seems that this leader mode can even co-exist with the multi-entry write
> mode that Cassandra uses now, just as
> you can use different CL for each single write request.  in that case you
> would need to keep both the current lightweight Phi-detector
> and add the ZK for leader election for single-entry mode write.
>
> Thanks
> Yang
>
>
> (I should correct my terminology .... it's not a "strong failure detector"
> that's needed, it's a "strong membership protocol". strongly complete and
> accurate failure detectors do not exist in
> async distributed systems (Tushar Chandra  " Unreliable Failure Detectors
> for Reliable Distributed Systems, Journal of the ACM, 43(2):225-267, 1996<http://doi.acm.org/10.1145/226643.226647>"
>  and FLP "Impossibility of  Distributed Consensus with One Faulty Process<http://www.podc.org/influential/2001.html>"
> )  )
>
>
> On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> The way HBase uses ZK (for master election) is not even close to how
>> Cassandra uses the failure detector.
>>
>> Using ZK for each operation would (a) not scale and (b) not work
>> cross-DC for any reasonable latency requirements.
>>
>> On Sat, Jul 2, 2011 at 11:55 AM, Yang <te...@gmail.com> wrote:
>> > there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch,
>> > so this does roughly what you want MOST of the time
>> >
>> > but the problem is that it does not GUARANTEE that the same node will
>> always
>> > be read.  I recently read into the HBase vs Cassandra comparison thread
>> that
>> > started after Facebook dropped Cassandra for their messaging system, and
>> > understood some of the differences. what you want is essentially what
>> HBase
>> > does. the fundamental difference there is really due to the gossip
>> protocol:
>> > it's a probablistic, or eventually consistent failure detector  while
>> > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
>> > detector (a distributed lock).  so in HBase, if a tablet server goes
>> down,
>> > it really goes down, it can not re-grab the tablet from the new tablet
>> > server without going through a start up protocol (notifying the master,
>> > which would notify the clients etc),  in other words it is guaranteed
>> that
>> > one tablet is served by only one tablet server at any given time.  in
>> > comparison the above JIRA only TRYIES to serve that key from one
>> particular
>> > replica. HBase can have that guarantee because the group membership is
>> > maintained by the strong failure detector.
>> > just for hacking curiosity, a strong failure detector + Cassandra
>> replicas
>> > is not impossible (actually seems not difficult), although the
>> performance
>> > is not clear. what would such a strong failure detector bring to
>> Cassandra
>> > besides this ONE-ONE strong consistency ? that is an interesting
>> question I
>> > think.
>> > considering that HBase has been deployed on big clusters, it is probably
>> OK
>> > with the performance of the strong  Zookeeper failure detector. then a
>> > further question was: why did Dynamo originally choose to use the
>> > probablistic failure detector? yes Dynamo's main theme is "eventually
>> > consistent", so the Phi-detector is **enough**, but if a strong detector
>> > buys us more with little cost, wouldn't that  be great?
>> >
>> >
>> > On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>> >>
>> >> Is this possible?
>> >>
>> >> All reads and writes for a given key will always go to the same node
>> from
>> >> a client.  It seems the only thing needed is to allow the clients to
>> compute
>> >> which node is the closes replica for the given key using the same
>> algorithm
>> >> C* uses.  When the first replica receives the write request, it will
>> write
>> >> to itself which should complete before any of the other replicas and
>> then
>> >> return.  The loads should still stay balanced if using random
>> partitioner.
>> >>  If the first replica becomes unavailable (however that is defined),
>> then
>> >> the clients can send to the next repilca in the ring and switch from
>> ONE
>> >> write/reads to QUORUM write/reads temporarily until the first replica
>> >> becomes available again.  QUORUM is required since there could be some
>> >> replicas that were not updated after the first replica went down.
>> >>
>> >> Will this work?  The goal is to have strong consistency with a
>> read/write
>> >> consistency level as low as possible while secondarily a network
>> performance
>> >> boost.
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>
They are not close.

HBase uses zookeeper and its master servers as a single source of truth.
When a node responsible for a region is down it can not be read or written
to until the master elects a new node to be responsible for the region. Also
all the storage is one back namely HDFS.

With Cassandra each node has a failure detector and it's own view of the
network. This is why Cassandra is peer to peer. This is why the failure
detector is consulted on every write/read because the network is converging.
With Cassandra each node has its own storage. This is why you can work at
ONE and QUORUM and clients are unaffected by single node failures.

If you want Cassandra to work like hbase you can....
1) Buy a SAN
2) make one LUN for each node
3) Use replication factor=1
4) running linux-ha on each node
5) when a Cassandra fails, linux-ha can detect failure and take over that
nodes data and Cassandra instance :)

:)

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

Yang, you seem to understand all of the details, at least the details 
that have occurred to me, such as having a failure protocol rather than 
a perfect failure detector and new leader coordination.

I finally did some more reading outside of Cassandra space and realized 
HBase has what I was asking about.  If Cass could be flexible enough to 
allow such a setup without violating it's goals, that would be great, imho.

This thread is just a brainstorming exploratory thread (by a non-expert) 
based on a simplistic observation that, if all clients went directly to 
the responsible replica every time, then performance and consistency can 
be increased by:

- providing guaranteed monotonic reads/writes consistency
- read-your-writes consistency
- higher performance (less latency)

all with only a read/write of ONE.

Basically, it's like a mater/slave setup except that the slaves can 
take-over as master, so you still have high availability.

I'm not saying it's easy and I'm only coming at this from a customer 
request point of view.  The question is, would this be useful if it 
could be added to Cass's bag of tricks?  Cass is already a hybrid.

aj

On 7/2/2011 1:57 PM, Yang wrote:
>
> Jonathan:
>
> could you please elaborate more on specifically why they are "not even 
> close"?
>  --- I kind of see what you mean (please correct me if I 
> misunderstood): Cassandra failure detector
> is consulted on every write; while HBase failure detector is only used 
> when the tablet server joins or leaves.
>
>  in order to have the single write entry point approach originally 
> brought up in this thread,
> I think you need a strong membership protocol to lock on the key range 
>  leadership, once leadership is acquired,
> failure detectors do not need to be consulted on every write.
>
> yes by definition of the original requirement brought up in this thread,
> Cassandra's write behavior is going to be changed, to be more like 
> Hbase, and mongo in "replica set" mode. but
> it seems that this leader mode can even co-exist with the multi-entry 
> write mode that Cassandra uses now, just as
> you can use different CL for each single write request.  in that case 
> you would need to keep both the current lightweight Phi-detector
> and add the ZK for leader election for single-entry mode write.
>
> Thanks
> Yang
>
>
> (I should correct my terminology .... it's not a "strong failure 
> detector" that's needed, it's a "strong membership protocol". strongly 
> complete and accurate failure detectors do not exist in
> async distributed systems (Tushar Chandra  "Unreliable Failure 
> Detectors for Reliable Distributed Systems, Journal of the ACM, 
> 43(2):225-267, 1996 <http://doi.acm.org/10.1145/226643.226647>"  and 
> FLP "Impossibility of  Distributed Consensus with One Faulty Process 
> <http://www.podc.org/influential/2001.html>" )  )
>
>
> On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <jbellis@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     The way HBase uses ZK (for master election) is not even close to how
>     Cassandra uses the failure detector.
>
>     Using ZK for each operation would (a) not scale and (b) not work
>     cross-DC for any reasonable latency requirements.
>
>     On Sat, Jul 2, 2011 at 11:55 AM, Yang <teddyyyy123@gmail.com
>     <ma...@gmail.com>> wrote:
>     > there is a JIRA completed in 0.7.x that "Prefers" a certain node
>     in snitch,
>     > so this does roughly what you want MOST of the time
>     >
>     > but the problem is that it does not GUARANTEE that the same node
>     will always
>     > be read.  I recently read into the HBase vs Cassandra comparison
>     thread that
>     > started after Facebook dropped Cassandra for their messaging
>     system, and
>     > understood some of the differences. what you want is essentially
>     what HBase
>     > does. the fundamental difference there is really due to the
>     gossip protocol:
>     > it's a probablistic, or eventually consistent failure detector
>      while
>     > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
>     failure
>     > detector (a distributed lock).  so in HBase, if a tablet server
>     goes down,
>     > it really goes down, it can not re-grab the tablet from the new
>     tablet
>     > server without going through a start up protocol (notifying the
>     master,
>     > which would notify the clients etc),  in other words it is
>     guaranteed that
>     > one tablet is served by only one tablet server at any given
>     time.  in
>     > comparison the above JIRA only TRYIES to serve that key from one
>     particular
>     > replica. HBase can have that guarantee because the group
>     membership is
>     > maintained by the strong failure detector.
>     > just for hacking curiosity, a strong failure detector +
>     Cassandra replicas
>     > is not impossible (actually seems not difficult), although the
>     performance
>     > is not clear. what would such a strong failure detector bring to
>     Cassandra
>     > besides this ONE-ONE strong consistency ? that is an interesting
>     question I
>     > think.
>     > considering that HBase has been deployed on big clusters, it is
>     probably OK
>     > with the performance of the strong  Zookeeper failure detector.
>     then a
>     > further question was: why did Dynamo originally choose to use the
>     > probablistic failure detector? yes Dynamo's main theme is
>     "eventually
>     > consistent", so the Phi-detector is **enough**, but if a strong
>     detector
>     > buys us more with little cost, wouldn't that  be great?
>     >
>     >
>     > On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net
>     <ma...@dude.podzone.net>> wrote:
>     >>
>     >> Is this possible?
>     >>
>     >> All reads and writes for a given key will always go to the same
>     node from
>     >> a client.  It seems the only thing needed is to allow the
>     clients to compute
>     >> which node is the closes replica for the given key using the
>     same algorithm
>     >> C* uses.  When the first replica receives the write request, it
>     will write
>     >> to itself which should complete before any of the other
>     replicas and then
>     >> return.  The loads should still stay balanced if using random
>     partitioner.
>     >>  If the first replica becomes unavailable (however that is
>     defined), then
>     >> the clients can send to the next repilca in the ring and switch
>     from ONE
>     >> write/reads to QUORUM write/reads temporarily until the first
>     replica
>     >> becomes available again.  QUORUM is required since there could
>     be some
>     >> replicas that were not updated after the first replica went down.
>     >>
>     >> Will this work?  The goal is to have strong consistency with a
>     read/write
>     >> consistency level as low as possible while secondarily a
>     network performance
>     >> boost.
>     >
>     >
>
>
>
>     --
>     Jonathan Ellis
>     Project Chair, Apache Cassandra
>     co-founder of DataStax, the source for professional Cassandra support
>     http://www.datastax.com
>
>

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

Jonathan:

could you please elaborate more on specifically why they are "not even
close"?
 --- I kind of see what you mean (please correct me if I misunderstood):
Cassandra failure detector
is consulted on every write; while HBase failure detector is only used when
the tablet server joins or leaves.

 in order to have the single write entry point approach originally brought
up in this thread,
I think you need a strong membership protocol to lock on the key range
 leadership, once leadership is acquired,
failure detectors do not need to be consulted on every write.

yes by definition of the original requirement brought up in this thread,
Cassandra's write behavior is going to be changed, to be more like Hbase,
and mongo in "replica set" mode. but
it seems that this leader mode can even co-exist with the multi-entry write
mode that Cassandra uses now, just as
you can use different CL for each single write request.  in that case you
would need to keep both the current lightweight Phi-detector
and add the ZK for leader election for single-entry mode write.

Thanks
Yang


(I should correct my terminology .... it's not a "strong failure detector"
that's needed, it's a "strong membership protocol". strongly complete and
accurate failure detectors do not exist in
async distributed systems (Tushar Chandra  " Unreliable Failure Detectors
for Reliable Distributed Systems, Journal of the ACM, 43(2):225-267,
1996<http://doi.acm.org/10.1145/226643.226647>"
 and FLP "Impossibility of  Distributed Consensus with One Faulty
Process<http://www.podc.org/influential/2001.html>"
)  )


On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> The way HBase uses ZK (for master election) is not even close to how
> Cassandra uses the failure detector.
>
> Using ZK for each operation would (a) not scale and (b) not work
> cross-DC for any reasonable latency requirements.
>
> On Sat, Jul 2, 2011 at 11:55 AM, Yang <te...@gmail.com> wrote:
> > there is a JIRA completed in 0.7.x that "Prefers" a certain node in
> snitch,
> > so this does roughly what you want MOST of the time
> >
> > but the problem is that it does not GUARANTEE that the same node will
> always
> > be read.  I recently read into the HBase vs Cassandra comparison thread
> that
> > started after Facebook dropped Cassandra for their messaging system, and
> > understood some of the differences. what you want is essentially what
> HBase
> > does. the fundamental difference there is really due to the gossip
> protocol:
> > it's a probablistic, or eventually consistent failure detector  while
> > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
> > detector (a distributed lock).  so in HBase, if a tablet server goes
> down,
> > it really goes down, it can not re-grab the tablet from the new tablet
> > server without going through a start up protocol (notifying the master,
> > which would notify the clients etc),  in other words it is guaranteed
> that
> > one tablet is served by only one tablet server at any given time.  in
> > comparison the above JIRA only TRYIES to serve that key from one
> particular
> > replica. HBase can have that guarantee because the group membership is
> > maintained by the strong failure detector.
> > just for hacking curiosity, a strong failure detector + Cassandra
> replicas
> > is not impossible (actually seems not difficult), although the
> performance
> > is not clear. what would such a strong failure detector bring to
> Cassandra
> > besides this ONE-ONE strong consistency ? that is an interesting question
> I
> > think.
> > considering that HBase has been deployed on big clusters, it is probably
> OK
> > with the performance of the strong  Zookeeper failure detector. then a
> > further question was: why did Dynamo originally choose to use the
> > probablistic failure detector? yes Dynamo's main theme is "eventually
> > consistent", so the Phi-detector is **enough**, but if a strong detector
> > buys us more with little cost, wouldn't that  be great?
> >
> >
> > On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
> >>
> >> Is this possible?
> >>
> >> All reads and writes for a given key will always go to the same node
> from
> >> a client.  It seems the only thing needed is to allow the clients to
> compute
> >> which node is the closes replica for the given key using the same
> algorithm
> >> C* uses.  When the first replica receives the write request, it will
> write
> >> to itself which should complete before any of the other replicas and
> then
> >> return.  The loads should still stay balanced if using random
> partitioner.
> >>  If the first replica becomes unavailable (however that is defined),
> then
> >> the clients can send to the next repilca in the ring and switch from ONE
> >> write/reads to QUORUM write/reads temporarily until the first replica
> >> becomes available again.  QUORUM is required since there could be some
> >> replicas that were not updated after the first replica went down.
> >>
> >> Will this work?  The goal is to have strong consistency with a
> read/write
> >> consistency level as low as possible while secondarily a network
> performance
> >> boost.
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Strong Consistency with ONE read/writes

Posted by Jonathan Ellis <jb...@gmail.com>.

The way HBase uses ZK (for master election) is not even close to how
Cassandra uses the failure detector.

Using ZK for each operation would (a) not scale and (b) not work
cross-DC for any reasonable latency requirements.

On Sat, Jul 2, 2011 at 11:55 AM, Yang <te...@gmail.com> wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch,
> so this does roughly what you want MOST of the time
>
> but the problem is that it does not GUARANTEE that the same node will always
> be read.  I recently read into the HBase vs Cassandra comparison thread that
> started after Facebook dropped Cassandra for their messaging system, and
> understood some of the differences. what you want is essentially what HBase
> does. the fundamental difference there is really due to the gossip protocol:
> it's a probablistic, or eventually consistent failure detector  while
> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
> detector (a distributed lock).  so in HBase, if a tablet server goes down,
> it really goes down, it can not re-grab the tablet from the new tablet
> server without going through a start up protocol (notifying the master,
> which would notify the clients etc),  in other words it is guaranteed that
> one tablet is served by only one tablet server at any given time.  in
> comparison the above JIRA only TRYIES to serve that key from one particular
> replica. HBase can have that guarantee because the group membership is
> maintained by the strong failure detector.
> just for hacking curiosity, a strong failure detector + Cassandra replicas
> is not impossible (actually seems not difficult), although the performance
> is not clear. what would such a strong failure detector bring to Cassandra
> besides this ONE-ONE strong consistency ? that is an interesting question I
> think.
> considering that HBase has been deployed on big clusters, it is probably OK
> with the performance of the strong  Zookeeper failure detector. then a
> further question was: why did Dynamo originally choose to use the
> probablistic failure detector? yes Dynamo's main theme is "eventually
> consistent", so the Phi-detector is **enough**, but if a strong detector
> buys us more with little cost, wouldn't that  be great?
>
>
> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same node from
>> a client.  It seems the only thing needed is to allow the clients to compute
>> which node is the closes replica for the given key using the same algorithm
>> C* uses.  When the first replica receives the write request, it will write
>> to itself which should complete before any of the other replicas and then
>> return.  The loads should still stay balanced if using random partitioner.
>>  If the first replica becomes unavailable (however that is defined), then
>> the clients can send to the next repilca in the ring and switch from ONE
>> write/reads to QUORUM write/reads temporarily until the first replica
>> becomes available again.  QUORUM is required since there could be some
>> replicas that were not updated after the first replica went down.
>>
>> Will this work?  The goal is to have strong consistency with a read/write
>> consistency level as low as possible while secondarily a network performance
>> boost.
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

that is not an important issue, it's separate from the replication
question I'm thinking about.
for now I'll just think about the case where every node owns the same
key range , or N=RF.


> Are you saying:  All replicas will receive the value whether or not they
> actually own the key range for the value.  If a node is not a replica for a
> value, it will not store it, but it will still write it in it's transaction
> log as a backup in case the leader dies.  Is that right?
>
>>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

On 7/12/2011 10:48 AM, Yang wrote:
> for example,
> coord writes record 1,2 ,3 ,4,5 in sequence
> if u have replica A, B, C
> currently A can have 1 , 3
> B can have 1,3,4,
> C can have 2345
>
> by "prefix", I mean I want them to have only 1---n  where n is some
> number  between 1 and 5,
> for example A having 1,2,3
> B having 1,2,3,4
> C having 1,2,3,4,5
>
> the way we enforce this prefix pattern is that
> 1) the leader is ensured to have everything that's sent out, otherwise
> it's removed from leader position
> 2) non-leader replicas is guaranteed to receive a prefix, because of
> FIFO of the connection between replica and coordinator, if this
> connection breaks, replica must catchup from the authoritative source
> of leader
>
> there is one point I hand-waved a bit: there are many coordinators,
> the "prefix" from each of them is different, still need to think about
> this, worst case is that we need to force the traffic come from the
> leader, which is less interesting because it's almost hbase then...
>

Are you saying:  All replicas will receive the value whether or not they 
actually own the key range for the value.  If a node is not a replica 
for a value, it will not store it, but it will still write it in it's 
transaction log as a backup in case the leader dies.  Is that right?

>
>
> On Tue, Jul 12, 2011 at 7:37 AM, AJ<aj...@dude.podzone.net>  wrote:
>> Yang, I'm not sure I understand what you mean by "prefix of the HLog".
>>   Also, can you explain what failure scenario you are talking about?  The
>> major failure that I see is when the leader node confirms to the client a
>> successful local write, but then fails before the write can be replicated to
>> any other replica node.  But, then again, you also say that the leader does
>> not forward replicas in your idea; so it's not real clear.
>>
>> I'm still trying to figure out how to make this work with normal Cass
>> operation.
>>
>> aj
>>
>> On 7/11/2011 3:48 PM, Yang wrote:
>>> I'm not proposing any changes to be done, but this looks like a very
>>> interesting topic for thought/hack/learning, so the following are only
>>> for thought exercises ....
>>>
>>>
>>> HBase enforces a single write/read entry point, so you can achieve
>>> strong consistency by writing/reading only one node.  but just writing
>>> to one node exposes you to loss of data if that node fails. so the
>>> region server HLog is replicated to 3 HDFS data nodes.  the
>>> interesting thing here is that each replica sees a complete *prefix*
>>> of the HLog: it won't miss a record, if a record sync() to a data node
>>> fails, all the existing bytes in the block are replicated to a new
>>> data node.
>>>
>>> if we employ a similar "leader" node among the N replicas of
>>> cassandra (coordinator always waits for the reply from leader, but
>>> leader does not do further replication like in HBase or counters), the
>>> leader sees all writes onto the key range, but the other replicas
>>> could miss some writes, as a result, each of the non-leader replicas'
>>> write history has some "holes", so when the leader dies, and when we
>>> elect a new one, no one is going to have a complete history. so you'd
>>> have to do a repair amongst all the replicas to reconstruct the full
>>> history, which is slow.
>>>
>>> it seems possible that we could utilize the FIFO property of the
>>> InComingTCPConnection to simplify history reconstruction, just like
>>> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
>>> that it may have missed some edits, then when it reconnects, we force
>>> it to talk to the active leader first, to catch up to date. when the
>>> leader dies, the next leader is elected to be the replica with the
>>> most recent history.  by maintaining the property that each node has a
>>> complete prefix of history, we only need to catch up on the tail of
>>> history, and avoid doing a complete repair on the entire
>>> memtable+SStable.  but one issue is that the history at the leader has
>>> to be kept really long ----- if a non-leader replica goes off for 2
>>> days, the leader has to keep all the history for 2 days to feed them
>>> to the replica when it comes back online. but possibly this could be
>>> limited to some max length so that over that length, the woken replica
>>> simply does a complete bootstrap.
>>>
>>>
>>> thanks
>>> yang
>>> On Sun, Jul 3, 2011 at 8:25 PM, AJ<aj...@dude.podzone.net>    wrote:
>>>> We seem to be having a fundamental misunderstanding.  Thanks for your
>>>> comments. aj
>>>>
>>>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>>>
>>>> I'm using cassandra as a tool, like a black box with a certain contract
>>>> to
>>>> the world.  Without modifying the "core", C* will send the updates to all
>>>> replicas, so your plan would cause the extra write (for the placeholder).
>>>>   I
>>>> wasn't assuming a modification to how C* fundamentally works.
>>>> Sounds like you are hacking (or at least looking) at the source, so all
>>>> the
>>>> power to you if/when you try these kind of changes.
>>>> will
>>>> On Sun, Jul 3, 2011 at 8:45 PM, AJ<aj...@dude.podzone.net>    wrote:
>>>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>>>
>>>>> Was just going off of: " Send the value to the primary replica and send
>>>>> placeholder values to the other replicas".  Sounded like you wanted to
>>>>> write
>>>>> the value to one, and write the placeholder to N-1 to me.
>>>>>
>>>>> Yes, that is what I was suggesting.  The point of the placeholders is to
>>>>> handle the crash case that I talked about... "like" a WAL does.
>>>>>
>>>>> But, C* will propagate the value to N-1 eventually anyways, 'cause
>>>>> that's
>>>>> just what it does anyways :-)
>>>>> will
>>>>>
>>>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ<aj...@dude.podzone.net>    wrote:
>>>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>>>
>>>>>> Why not send the value itself instead of a placeholder?  Now it takes
>>>>>> 2x
>>>>>> writes on a random node to do a single update (write placeholder, write
>>>>>> update) and N*x writes from the client (write value, write placeholder
>>>>>> to
>>>>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>>>>> instead of less...
>>>>>>
>>>>>> To send the value to each node is 1.) unnecessary, 2.) will only cause
>>>>>> a
>>>>>> large burst of network traffic.  Think about if it's a large data
>>>>>> value,
>>>>>> such as a document.  Just let C* do it's thing.  The extra messages are
>>>>>> tiny
>>>>>> and doesn't significantly increase latency since they are all sent
>>>>>> asynchronously.
>>>>>>
>>>>>>
>>>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>>>> internals in a Cassandra client (just guessing, I'm not a cassandra
>>>>>> dev)
>>>>>>
>>>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>>>
>>>>>>
>>>>>> On Jul 3, 2011, at 5:20 PM, AJ<aj...@dude.podzone.net>    wrote:
>>>>>>
>>>>>> Yang,
>>>>>>
>>>>>> How would you deal with the problem when the 1st node responds success
>>>>>> but then crashes before completely forwarding any replicas?  Then,
>>>>>> after
>>>>>> switching to the next primary, a read would return stale data.
>>>>>>
>>>>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>>>>> send placeholder values to the other replicas.  The placeholder value
>>>>>> is
>>>>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>>>>> timestamps 1 less than the timestamp for the actual value that went to
>>>>>> the
>>>>>> primary.  Later, when the changes propagate, the actual values will
>>>>>> overwrite the placeholders.  In event of a crash before the placeholder
>>>>>> gets
>>>>>> overwritten, the next read value will tell the client so.  The client
>>>>>> will
>>>>>> report to the user that the key/column is unavailable.  The downside is
>>>>>> you've overwritten your data and maybe would like to know what the old
>>>>>> data
>>>>>> was!  But, maybe there's another way using other columns or with MVCC.
>>>>>>   The
>>>>>> client would want a success from the primary and the secondary replicas
>>>>>> to
>>>>>> be certain of future read consistency in case the primary goes down
>>>>>> immediately as I said above.  The ability to set an "update_pending"
>>>>>> flag on
>>>>>> any column value would probably make this work.  But, I'll think more
>>>>>> on
>>>>>> this later.
>>>>>>
>>>>>> aj
>>>>>>
>>>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>>>
>>>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>>>> snitch, so this does roughly what you want MOST of the time
>>>>>>
>>>>>> but the problem is that it does not GUARANTEE that the same node will
>>>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>>>> thread that started after Facebook dropped Cassandra for their
>>>>>> messaging
>>>>>> system, and understood some of the differences. what you want is
>>>>>> essentially
>>>>>> what HBase does. the fundamental difference there is really due to the
>>>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>>>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>>>>> server goes down, it really goes down, it can not re-grab the tablet
>>>>>> from
>>>>>> the new tablet server without going through a start up protocol
>>>>>> (notifying
>>>>>> the master, which would notify the clients etc),  in other words it is
>>>>>> guaranteed that one tablet is served by only one tablet server at any
>>>>>> given
>>>>>> time.  in comparison the above JIRA only TRYIES to serve that key from
>>>>>> one
>>>>>> particular replica. HBase can have that guarantee because the group
>>>>>> membership is maintained by the strong failure detector.
>>>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>>>> replicas is not impossible (actually seems not difficult), although the
>>>>>> performance is not clear. what would such a strong failure detector
>>>>>> bring to
>>>>>> Cassandra besides this ONE-ONE strong consistency ? that is an
>>>>>> interesting
>>>>>> question I think.
>>>>>> considering that HBase has been deployed on big clusters, it is
>>>>>> probably
>>>>>> OK with the performance of the strong  Zookeeper failure detector. then
>>>>>> a
>>>>>> further question was: why did Dynamo originally choose to use the
>>>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>>>> consistent", so the Phi-detector is **enough**, but if a strong
>>>>>> detector
>>>>>> buys us more with little cost, wouldn't that  be great?
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ<aj...@dude.podzone.net>    wrote:
>>>>>>> Is this possible?
>>>>>>>
>>>>>>> All reads and writes for a given key will always go to the same node
>>>>>>> from a client.  It seems the only thing needed is to allow the clients
>>>>>>> to
>>>>>>> compute which node is the closes replica for the given key using the
>>>>>>> same
>>>>>>> algorithm C* uses.  When the first replica receives the write request,
>>>>>>> it
>>>>>>> will write to itself which should complete before any of the other
>>>>>>> replicas
>>>>>>> and then return.  The loads should still stay balanced if using random
>>>>>>> partitioner.  If the first replica becomes unavailable (however that
>>>>>>> is
>>>>>>> defined), then the clients can send to the next repilca in the ring
>>>>>>> and
>>>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until
>>>>>>> the
>>>>>>> first replica becomes available again.  QUORUM is required since there
>>>>>>> could
>>>>>>> be some replicas that were not updated after the first replica went
>>>>>>> down.
>>>>>>>
>>>>>>> Will this work?  The goal is to have strong consistency with a
>>>>>>> read/write consistency level as low as possible while secondarily a
>>>>>>> network
>>>>>>> performance boost.
>>>>>>
>>>>>
>>>>> --
>>>>> Will Oberman
>>>>> Civic Science, Inc.
>>>>> 3030 Penn Avenue., First Floor
>>>>> Pittsburgh, PA 15201
>>>>> (M) 412-480-7835
>>>>> (E) oberman@civicscience.com
>>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>>
>>

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

for example,
coord writes record 1,2 ,3 ,4,5 in sequence
if u have replica A, B, C
currently A can have 1 , 3
B can have 1,3,4,
C can have 2345

by "prefix", I mean I want them to have only 1---n  where n is some
number  between 1 and 5,
for example A having 1,2,3
B having 1,2,3,4
C having 1,2,3,4,5

the way we enforce this prefix pattern is that
1) the leader is ensured to have everything that's sent out, otherwise
it's removed from leader position
2) non-leader replicas is guaranteed to receive a prefix, because of
FIFO of the connection between replica and coordinator, if this
connection breaks, replica must catchup from the authoritative source
of leader

there is one point I hand-waved a bit: there are many coordinators,
the "prefix" from each of them is different, still need to think about
this, worst case is that we need to force the traffic come from the
leader, which is less interesting because it's almost hbase then...




On Tue, Jul 12, 2011 at 7:37 AM, AJ <aj...@dude.podzone.net> wrote:
> Yang, I'm not sure I understand what you mean by "prefix of the HLog".
>  Also, can you explain what failure scenario you are talking about?  The
> major failure that I see is when the leader node confirms to the client a
> successful local write, but then fails before the write can be replicated to
> any other replica node.  But, then again, you also say that the leader does
> not forward replicas in your idea; so it's not real clear.
>
> I'm still trying to figure out how to make this work with normal Cass
> operation.
>
> aj
>
> On 7/11/2011 3:48 PM, Yang wrote:
>>
>> I'm not proposing any changes to be done, but this looks like a very
>> interesting topic for thought/hack/learning, so the following are only
>> for thought exercises ....
>>
>>
>> HBase enforces a single write/read entry point, so you can achieve
>> strong consistency by writing/reading only one node.  but just writing
>> to one node exposes you to loss of data if that node fails. so the
>> region server HLog is replicated to 3 HDFS data nodes.  the
>> interesting thing here is that each replica sees a complete *prefix*
>> of the HLog: it won't miss a record, if a record sync() to a data node
>> fails, all the existing bytes in the block are replicated to a new
>> data node.
>>
>> if we employ a similar "leader" node among the N replicas of
>> cassandra (coordinator always waits for the reply from leader, but
>> leader does not do further replication like in HBase or counters), the
>> leader sees all writes onto the key range, but the other replicas
>> could miss some writes, as a result, each of the non-leader replicas'
>> write history has some "holes", so when the leader dies, and when we
>> elect a new one, no one is going to have a complete history. so you'd
>> have to do a repair amongst all the replicas to reconstruct the full
>> history, which is slow.
>>
>> it seems possible that we could utilize the FIFO property of the
>> InComingTCPConnection to simplify history reconstruction, just like
>> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
>> that it may have missed some edits, then when it reconnects, we force
>> it to talk to the active leader first, to catch up to date. when the
>> leader dies, the next leader is elected to be the replica with the
>> most recent history.  by maintaining the property that each node has a
>> complete prefix of history, we only need to catch up on the tail of
>> history, and avoid doing a complete repair on the entire
>> memtable+SStable.  but one issue is that the history at the leader has
>> to be kept really long ----- if a non-leader replica goes off for 2
>> days, the leader has to keep all the history for 2 days to feed them
>> to the replica when it comes back online. but possibly this could be
>> limited to some max length so that over that length, the woken replica
>> simply does a complete bootstrap.
>>
>>
>> thanks
>> yang
>> On Sun, Jul 3, 2011 at 8:25 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>
>>> We seem to be having a fundamental misunderstanding.  Thanks for your
>>> comments. aj
>>>
>>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>>
>>> I'm using cassandra as a tool, like a black box with a certain contract
>>> to
>>> the world.  Without modifying the "core", C* will send the updates to all
>>> replicas, so your plan would cause the extra write (for the placeholder).
>>>  I
>>> wasn't assuming a modification to how C* fundamentally works.
>>> Sounds like you are hacking (or at least looking) at the source, so all
>>> the
>>> power to you if/when you try these kind of changes.
>>> will
>>> On Sun, Jul 3, 2011 at 8:45 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>
>>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>>
>>>> Was just going off of: " Send the value to the primary replica and send
>>>> placeholder values to the other replicas".  Sounded like you wanted to
>>>> write
>>>> the value to one, and write the placeholder to N-1 to me.
>>>>
>>>> Yes, that is what I was suggesting.  The point of the placeholders is to
>>>> handle the crash case that I talked about... "like" a WAL does.
>>>>
>>>> But, C* will propagate the value to N-1 eventually anyways, 'cause
>>>> that's
>>>> just what it does anyways :-)
>>>> will
>>>>
>>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>>
>>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>>
>>>>> Why not send the value itself instead of a placeholder?  Now it takes
>>>>> 2x
>>>>> writes on a random node to do a single update (write placeholder, write
>>>>> update) and N*x writes from the client (write value, write placeholder
>>>>> to
>>>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>>>> instead of less...
>>>>>
>>>>> To send the value to each node is 1.) unnecessary, 2.) will only cause
>>>>> a
>>>>> large burst of network traffic.  Think about if it's a large data
>>>>> value,
>>>>> such as a document.  Just let C* do it's thing.  The extra messages are
>>>>> tiny
>>>>> and doesn't significantly increase latency since they are all sent
>>>>> asynchronously.
>>>>>
>>>>>
>>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>>> internals in a Cassandra client (just guessing, I'm not a cassandra
>>>>> dev)
>>>>>
>>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>>
>>>>>
>>>>> On Jul 3, 2011, at 5:20 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>>
>>>>> Yang,
>>>>>
>>>>> How would you deal with the problem when the 1st node responds success
>>>>> but then crashes before completely forwarding any replicas?  Then,
>>>>> after
>>>>> switching to the next primary, a read would return stale data.
>>>>>
>>>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>>>> send placeholder values to the other replicas.  The placeholder value
>>>>> is
>>>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>>>> timestamps 1 less than the timestamp for the actual value that went to
>>>>> the
>>>>> primary.  Later, when the changes propagate, the actual values will
>>>>> overwrite the placeholders.  In event of a crash before the placeholder
>>>>> gets
>>>>> overwritten, the next read value will tell the client so.  The client
>>>>> will
>>>>> report to the user that the key/column is unavailable.  The downside is
>>>>> you've overwritten your data and maybe would like to know what the old
>>>>> data
>>>>> was!  But, maybe there's another way using other columns or with MVCC.
>>>>>  The
>>>>> client would want a success from the primary and the secondary replicas
>>>>> to
>>>>> be certain of future read consistency in case the primary goes down
>>>>> immediately as I said above.  The ability to set an "update_pending"
>>>>> flag on
>>>>> any column value would probably make this work.  But, I'll think more
>>>>> on
>>>>> this later.
>>>>>
>>>>> aj
>>>>>
>>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>>
>>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>>> snitch, so this does roughly what you want MOST of the time
>>>>>
>>>>> but the problem is that it does not GUARANTEE that the same node will
>>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>>> thread that started after Facebook dropped Cassandra for their
>>>>> messaging
>>>>> system, and understood some of the differences. what you want is
>>>>> essentially
>>>>> what HBase does. the fundamental difference there is really due to the
>>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>>>> server goes down, it really goes down, it can not re-grab the tablet
>>>>> from
>>>>> the new tablet server without going through a start up protocol
>>>>> (notifying
>>>>> the master, which would notify the clients etc),  in other words it is
>>>>> guaranteed that one tablet is served by only one tablet server at any
>>>>> given
>>>>> time.  in comparison the above JIRA only TRYIES to serve that key from
>>>>> one
>>>>> particular replica. HBase can have that guarantee because the group
>>>>> membership is maintained by the strong failure detector.
>>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>>> replicas is not impossible (actually seems not difficult), although the
>>>>> performance is not clear. what would such a strong failure detector
>>>>> bring to
>>>>> Cassandra besides this ONE-ONE strong consistency ? that is an
>>>>> interesting
>>>>> question I think.
>>>>> considering that HBase has been deployed on big clusters, it is
>>>>> probably
>>>>> OK with the performance of the strong  Zookeeper failure detector. then
>>>>> a
>>>>> further question was: why did Dynamo originally choose to use the
>>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>>> consistent", so the Phi-detector is **enough**, but if a strong
>>>>> detector
>>>>> buys us more with little cost, wouldn't that  be great?
>>>>>
>>>>>
>>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>>>
>>>>>> Is this possible?
>>>>>>
>>>>>> All reads and writes for a given key will always go to the same node
>>>>>> from a client.  It seems the only thing needed is to allow the clients
>>>>>> to
>>>>>> compute which node is the closes replica for the given key using the
>>>>>> same
>>>>>> algorithm C* uses.  When the first replica receives the write request,
>>>>>> it
>>>>>> will write to itself which should complete before any of the other
>>>>>> replicas
>>>>>> and then return.  The loads should still stay balanced if using random
>>>>>> partitioner.  If the first replica becomes unavailable (however that
>>>>>> is
>>>>>> defined), then the clients can send to the next repilca in the ring
>>>>>> and
>>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until
>>>>>> the
>>>>>> first replica becomes available again.  QUORUM is required since there
>>>>>> could
>>>>>> be some replicas that were not updated after the first replica went
>>>>>> down.
>>>>>>
>>>>>> Will this work?  The goal is to have strong consistency with a
>>>>>> read/write consistency level as low as possible while secondarily a
>>>>>> network
>>>>>> performance boost.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>>
>
>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

Yang, I'm not sure I understand what you mean by "prefix of the HLog".  
Also, can you explain what failure scenario you are talking about?  The 
major failure that I see is when the leader node confirms to the client 
a successful local write, but then fails before the write can be 
replicated to any other replica node.  But, then again, you also say 
that the leader does not forward replicas in your idea; so it's not real 
clear.

I'm still trying to figure out how to make this work with normal Cass 
operation.

aj

On 7/11/2011 3:48 PM, Yang wrote:
> I'm not proposing any changes to be done, but this looks like a very
> interesting topic for thought/hack/learning, so the following are only
> for thought exercises ....
>
>
> HBase enforces a single write/read entry point, so you can achieve
> strong consistency by writing/reading only one node.  but just writing
> to one node exposes you to loss of data if that node fails. so the
> region server HLog is replicated to 3 HDFS data nodes.  the
> interesting thing here is that each replica sees a complete *prefix*
> of the HLog: it won't miss a record, if a record sync() to a data node
> fails, all the existing bytes in the block are replicated to a new
> data node.
>
> if we employ a similar "leader" node among the N replicas of
> cassandra (coordinator always waits for the reply from leader, but
> leader does not do further replication like in HBase or counters), the
> leader sees all writes onto the key range, but the other replicas
> could miss some writes, as a result, each of the non-leader replicas'
> write history has some "holes", so when the leader dies, and when we
> elect a new one, no one is going to have a complete history. so you'd
> have to do a repair amongst all the replicas to reconstruct the full
> history, which is slow.
>
> it seems possible that we could utilize the FIFO property of the
> InComingTCPConnection to simplify history reconstruction, just like
> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
> that it may have missed some edits, then when it reconnects, we force
> it to talk to the active leader first, to catch up to date. when the
> leader dies, the next leader is elected to be the replica with the
> most recent history.  by maintaining the property that each node has a
> complete prefix of history, we only need to catch up on the tail of
> history, and avoid doing a complete repair on the entire
> memtable+SStable.  but one issue is that the history at the leader has
> to be kept really long ----- if a non-leader replica goes off for 2
> days, the leader has to keep all the history for 2 days to feed them
> to the replica when it comes back online. but possibly this could be
> limited to some max length so that over that length, the woken replica
> simply does a complete bootstrap.
>
>
> thanks
> yang
> On Sun, Jul 3, 2011 at 8:25 PM, AJ<aj...@dude.podzone.net>  wrote:
>> We seem to be having a fundamental misunderstanding.  Thanks for your
>> comments. aj
>>
>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>
>> I'm using cassandra as a tool, like a black box with a certain contract to
>> the world.  Without modifying the "core", C* will send the updates to all
>> replicas, so your plan would cause the extra write (for the placeholder).  I
>> wasn't assuming a modification to how C* fundamentally works.
>> Sounds like you are hacking (or at least looking) at the source, so all the
>> power to you if/when you try these kind of changes.
>> will
>> On Sun, Jul 3, 2011 at 8:45 PM, AJ<aj...@dude.podzone.net>  wrote:
>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>
>>> Was just going off of: " Send the value to the primary replica and send
>>> placeholder values to the other replicas".  Sounded like you wanted to write
>>> the value to one, and write the placeholder to N-1 to me.
>>>
>>> Yes, that is what I was suggesting.  The point of the placeholders is to
>>> handle the crash case that I talked about... "like" a WAL does.
>>>
>>> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
>>> just what it does anyways :-)
>>> will
>>>
>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>
>>>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>>>> writes on a random node to do a single update (write placeholder, write
>>>> update) and N*x writes from the client (write value, write placeholder to
>>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>>> instead of less...
>>>>
>>>> To send the value to each node is 1.) unnecessary, 2.) will only cause a
>>>> large burst of network traffic.  Think about if it's a large data value,
>>>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>>>> and doesn't significantly increase latency since they are all sent
>>>> asynchronously.
>>>>
>>>>
>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>>>
>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>
>>>>
>>>> On Jul 3, 2011, at 5:20 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>
>>>> Yang,
>>>>
>>>> How would you deal with the problem when the 1st node responds success
>>>> but then crashes before completely forwarding any replicas?  Then, after
>>>> switching to the next primary, a read would return stale data.
>>>>
>>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>>> send placeholder values to the other replicas.  The placeholder value is
>>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>>> timestamps 1 less than the timestamp for the actual value that went to the
>>>> primary.  Later, when the changes propagate, the actual values will
>>>> overwrite the placeholders.  In event of a crash before the placeholder gets
>>>> overwritten, the next read value will tell the client so.  The client will
>>>> report to the user that the key/column is unavailable.  The downside is
>>>> you've overwritten your data and maybe would like to know what the old data
>>>> was!  But, maybe there's another way using other columns or with MVCC.  The
>>>> client would want a success from the primary and the secondary replicas to
>>>> be certain of future read consistency in case the primary goes down
>>>> immediately as I said above.  The ability to set an "update_pending" flag on
>>>> any column value would probably make this work.  But, I'll think more on
>>>> this later.
>>>>
>>>> aj
>>>>
>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>
>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>> snitch, so this does roughly what you want MOST of the time
>>>>
>>>> but the problem is that it does not GUARANTEE that the same node will
>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>> thread that started after Facebook dropped Cassandra for their messaging
>>>> system, and understood some of the differences. what you want is essentially
>>>> what HBase does. the fundamental difference there is really due to the
>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>>> server goes down, it really goes down, it can not re-grab the tablet from
>>>> the new tablet server without going through a start up protocol (notifying
>>>> the master, which would notify the clients etc),  in other words it is
>>>> guaranteed that one tablet is served by only one tablet server at any given
>>>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>>>> particular replica. HBase can have that guarantee because the group
>>>> membership is maintained by the strong failure detector.
>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>> replicas is not impossible (actually seems not difficult), although the
>>>> performance is not clear. what would such a strong failure detector bring to
>>>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>>>> question I think.
>>>> considering that HBase has been deployed on big clusters, it is probably
>>>> OK with the performance of the strong  Zookeeper failure detector. then a
>>>> further question was: why did Dynamo originally choose to use the
>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>> consistent", so the Phi-detector is **enough**, but if a strong detector
>>>> buys us more with little cost, wouldn't that  be great?
>>>>
>>>>
>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ<aj...@dude.podzone.net>  wrote:
>>>>> Is this possible?
>>>>>
>>>>> All reads and writes for a given key will always go to the same node
>>>>> from a client.  It seems the only thing needed is to allow the clients to
>>>>> compute which node is the closes replica for the given key using the same
>>>>> algorithm C* uses.  When the first replica receives the write request, it
>>>>> will write to itself which should complete before any of the other replicas
>>>>> and then return.  The loads should still stay balanced if using random
>>>>> partitioner.  If the first replica becomes unavailable (however that is
>>>>> defined), then the clients can send to the next repilca in the ring and
>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until the
>>>>> first replica becomes available again.  QUORUM is required since there could
>>>>> be some replicas that were not updated after the first replica went down.
>>>>>
>>>>> Will this work?  The goal is to have strong consistency with a
>>>>> read/write consistency level as low as possible while secondarily a network
>>>>> performance boost.
>>>>
>>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>>

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

thanks , let me read it...

On Tue, Jul 12, 2011 at 9:27 AM, Ryan King <ry...@twitter.com> wrote:
> If you're interested in this idea, you should read up about Spinnaker:
> http://www.vldb.org/pvldb/vol4/p243-rao.pdf
>
> -ryan
>
> On Mon, Jul 11, 2011 at 2:48 PM, Yang <te...@gmail.com> wrote:
>> I'm not proposing any changes to be done, but this looks like a very
>> interesting topic for thought/hack/learning, so the following are only
>> for thought exercises ....
>>
>>
>> HBase enforces a single write/read entry point, so you can achieve
>> strong consistency by writing/reading only one node.  but just writing
>> to one node exposes you to loss of data if that node fails. so the
>> region server HLog is replicated to 3 HDFS data nodes.  the
>> interesting thing here is that each replica sees a complete *prefix*
>> of the HLog: it won't miss a record, if a record sync() to a data node
>> fails, all the existing bytes in the block are replicated to a new
>> data node.
>>
>> if we employ a similar "leader" node among the N replicas of
>> cassandra (coordinator always waits for the reply from leader, but
>> leader does not do further replication like in HBase or counters), the
>> leader sees all writes onto the key range, but the other replicas
>> could miss some writes, as a result, each of the non-leader replicas'
>> write history has some "holes", so when the leader dies, and when we
>> elect a new one, no one is going to have a complete history. so you'd
>> have to do a repair amongst all the replicas to reconstruct the full
>> history, which is slow.
>>
>> it seems possible that we could utilize the FIFO property of the
>> InComingTCPConnection to simplify history reconstruction, just like
>> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
>> that it may have missed some edits, then when it reconnects, we force
>> it to talk to the active leader first, to catch up to date. when the
>> leader dies, the next leader is elected to be the replica with the
>> most recent history.  by maintaining the property that each node has a
>> complete prefix of history, we only need to catch up on the tail of
>> history, and avoid doing a complete repair on the entire
>> memtable+SStable.  but one issue is that the history at the leader has
>> to be kept really long ----- if a non-leader replica goes off for 2
>> days, the leader has to keep all the history for 2 days to feed them
>> to the replica when it comes back online. but possibly this could be
>> limited to some max length so that over that length, the woken replica
>> simply does a complete bootstrap.
>>
>>
>> thanks
>> yang
>> On Sun, Jul 3, 2011 at 8:25 PM, AJ <aj...@dude.podzone.net> wrote:
>>> We seem to be having a fundamental misunderstanding.  Thanks for your
>>> comments. aj
>>>
>>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>>
>>> I'm using cassandra as a tool, like a black box with a certain contract to
>>> the world.  Without modifying the "core", C* will send the updates to all
>>> replicas, so your plan would cause the extra write (for the placeholder).  I
>>> wasn't assuming a modification to how C* fundamentally works.
>>> Sounds like you are hacking (or at least looking) at the source, so all the
>>> power to you if/when you try these kind of changes.
>>> will
>>> On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>
>>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>>
>>>> Was just going off of: " Send the value to the primary replica and send
>>>> placeholder values to the other replicas".  Sounded like you wanted to write
>>>> the value to one, and write the placeholder to N-1 to me.
>>>>
>>>> Yes, that is what I was suggesting.  The point of the placeholders is to
>>>> handle the crash case that I talked about... "like" a WAL does.
>>>>
>>>> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
>>>> just what it does anyways :-)
>>>> will
>>>>
>>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>>
>>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>>
>>>>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>>>>> writes on a random node to do a single update (write placeholder, write
>>>>> update) and N*x writes from the client (write value, write placeholder to
>>>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>>>> instead of less...
>>>>>
>>>>> To send the value to each node is 1.) unnecessary, 2.) will only cause a
>>>>> large burst of network traffic.  Think about if it's a large data value,
>>>>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>>>>> and doesn't significantly increase latency since they are all sent
>>>>> asynchronously.
>>>>>
>>>>>
>>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>>>>
>>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>>
>>>>>
>>>>> On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>>
>>>>> Yang,
>>>>>
>>>>> How would you deal with the problem when the 1st node responds success
>>>>> but then crashes before completely forwarding any replicas?  Then, after
>>>>> switching to the next primary, a read would return stale data.
>>>>>
>>>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>>>> send placeholder values to the other replicas.  The placeholder value is
>>>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>>>> timestamps 1 less than the timestamp for the actual value that went to the
>>>>> primary.  Later, when the changes propagate, the actual values will
>>>>> overwrite the placeholders.  In event of a crash before the placeholder gets
>>>>> overwritten, the next read value will tell the client so.  The client will
>>>>> report to the user that the key/column is unavailable.  The downside is
>>>>> you've overwritten your data and maybe would like to know what the old data
>>>>> was!  But, maybe there's another way using other columns or with MVCC.  The
>>>>> client would want a success from the primary and the secondary replicas to
>>>>> be certain of future read consistency in case the primary goes down
>>>>> immediately as I said above.  The ability to set an "update_pending" flag on
>>>>> any column value would probably make this work.  But, I'll think more on
>>>>> this later.
>>>>>
>>>>> aj
>>>>>
>>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>>
>>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>>> snitch, so this does roughly what you want MOST of the time
>>>>>
>>>>> but the problem is that it does not GUARANTEE that the same node will
>>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>>> thread that started after Facebook dropped Cassandra for their messaging
>>>>> system, and understood some of the differences. what you want is essentially
>>>>> what HBase does. the fundamental difference there is really due to the
>>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>>>> server goes down, it really goes down, it can not re-grab the tablet from
>>>>> the new tablet server without going through a start up protocol (notifying
>>>>> the master, which would notify the clients etc),  in other words it is
>>>>> guaranteed that one tablet is served by only one tablet server at any given
>>>>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>>>>> particular replica. HBase can have that guarantee because the group
>>>>> membership is maintained by the strong failure detector.
>>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>>> replicas is not impossible (actually seems not difficult), although the
>>>>> performance is not clear. what would such a strong failure detector bring to
>>>>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>>>>> question I think.
>>>>> considering that HBase has been deployed on big clusters, it is probably
>>>>> OK with the performance of the strong  Zookeeper failure detector. then a
>>>>> further question was: why did Dynamo originally choose to use the
>>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>>> consistent", so the Phi-detector is **enough**, but if a strong detector
>>>>> buys us more with little cost, wouldn't that  be great?
>>>>>
>>>>>
>>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>>>
>>>>>> Is this possible?
>>>>>>
>>>>>> All reads and writes for a given key will always go to the same node
>>>>>> from a client.  It seems the only thing needed is to allow the clients to
>>>>>> compute which node is the closes replica for the given key using the same
>>>>>> algorithm C* uses.  When the first replica receives the write request, it
>>>>>> will write to itself which should complete before any of the other replicas
>>>>>> and then return.  The loads should still stay balanced if using random
>>>>>> partitioner.  If the first replica becomes unavailable (however that is
>>>>>> defined), then the clients can send to the next repilca in the ring and
>>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until the
>>>>>> first replica becomes available again.  QUORUM is required since there could
>>>>>> be some replicas that were not updated after the first replica went down.
>>>>>>
>>>>>> Will this work?  The goal is to have strong consistency with a
>>>>>> read/write consistency level as low as possible while secondarily a network
>>>>>> performance boost.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>>
>>
>

Re: Strong Consistency with ONE read/writes

Posted by Ryan King <ry...@twitter.com>.

If you're interested in this idea, you should read up about Spinnaker:
http://www.vldb.org/pvldb/vol4/p243-rao.pdf

-ryan

On Mon, Jul 11, 2011 at 2:48 PM, Yang <te...@gmail.com> wrote:
> I'm not proposing any changes to be done, but this looks like a very
> interesting topic for thought/hack/learning, so the following are only
> for thought exercises ....
>
>
> HBase enforces a single write/read entry point, so you can achieve
> strong consistency by writing/reading only one node.  but just writing
> to one node exposes you to loss of data if that node fails. so the
> region server HLog is replicated to 3 HDFS data nodes.  the
> interesting thing here is that each replica sees a complete *prefix*
> of the HLog: it won't miss a record, if a record sync() to a data node
> fails, all the existing bytes in the block are replicated to a new
> data node.
>
> if we employ a similar "leader" node among the N replicas of
> cassandra (coordinator always waits for the reply from leader, but
> leader does not do further replication like in HBase or counters), the
> leader sees all writes onto the key range, but the other replicas
> could miss some writes, as a result, each of the non-leader replicas'
> write history has some "holes", so when the leader dies, and when we
> elect a new one, no one is going to have a complete history. so you'd
> have to do a repair amongst all the replicas to reconstruct the full
> history, which is slow.
>
> it seems possible that we could utilize the FIFO property of the
> InComingTCPConnection to simplify history reconstruction, just like
> Zookeeper. if the IncomingTcpConnection of a replica fails, that means
> that it may have missed some edits, then when it reconnects, we force
> it to talk to the active leader first, to catch up to date. when the
> leader dies, the next leader is elected to be the replica with the
> most recent history.  by maintaining the property that each node has a
> complete prefix of history, we only need to catch up on the tail of
> history, and avoid doing a complete repair on the entire
> memtable+SStable.  but one issue is that the history at the leader has
> to be kept really long ----- if a non-leader replica goes off for 2
> days, the leader has to keep all the history for 2 days to feed them
> to the replica when it comes back online. but possibly this could be
> limited to some max length so that over that length, the woken replica
> simply does a complete bootstrap.
>
>
> thanks
> yang
> On Sun, Jul 3, 2011 at 8:25 PM, AJ <aj...@dude.podzone.net> wrote:
>> We seem to be having a fundamental misunderstanding.  Thanks for your
>> comments. aj
>>
>> On 7/3/2011 8:28 PM, William Oberman wrote:
>>
>> I'm using cassandra as a tool, like a black box with a certain contract to
>> the world.  Without modifying the "core", C* will send the updates to all
>> replicas, so your plan would cause the extra write (for the placeholder).  I
>> wasn't assuming a modification to how C* fundamentally works.
>> Sounds like you are hacking (or at least looking) at the source, so all the
>> power to you if/when you try these kind of changes.
>> will
>> On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj...@dude.podzone.net> wrote:
>>>
>>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>>
>>> Was just going off of: " Send the value to the primary replica and send
>>> placeholder values to the other replicas".  Sounded like you wanted to write
>>> the value to one, and write the placeholder to N-1 to me.
>>>
>>> Yes, that is what I was suggesting.  The point of the placeholders is to
>>> handle the crash case that I talked about... "like" a WAL does.
>>>
>>> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
>>> just what it does anyways :-)
>>> will
>>>
>>> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>
>>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>>
>>>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>>>> writes on a random node to do a single update (write placeholder, write
>>>> update) and N*x writes from the client (write value, write placeholder to
>>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>>> instead of less...
>>>>
>>>> To send the value to each node is 1.) unnecessary, 2.) will only cause a
>>>> large burst of network traffic.  Think about if it's a large data value,
>>>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>>>> and doesn't significantly increase latency since they are all sent
>>>> asynchronously.
>>>>
>>>>
>>>> Of course, I still think this sounds like reimplementing Cassandra
>>>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>>>
>>>> I don't see how.  Maybe you should take a peek at the source.
>>>>
>>>>
>>>> On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>
>>>> Yang,
>>>>
>>>> How would you deal with the problem when the 1st node responds success
>>>> but then crashes before completely forwarding any replicas?  Then, after
>>>> switching to the next primary, a read would return stale data.
>>>>
>>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>>> send placeholder values to the other replicas.  The placeholder value is
>>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>>> timestamps 1 less than the timestamp for the actual value that went to the
>>>> primary.  Later, when the changes propagate, the actual values will
>>>> overwrite the placeholders.  In event of a crash before the placeholder gets
>>>> overwritten, the next read value will tell the client so.  The client will
>>>> report to the user that the key/column is unavailable.  The downside is
>>>> you've overwritten your data and maybe would like to know what the old data
>>>> was!  But, maybe there's another way using other columns or with MVCC.  The
>>>> client would want a success from the primary and the secondary replicas to
>>>> be certain of future read consistency in case the primary goes down
>>>> immediately as I said above.  The ability to set an "update_pending" flag on
>>>> any column value would probably make this work.  But, I'll think more on
>>>> this later.
>>>>
>>>> aj
>>>>
>>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>>
>>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>>> snitch, so this does roughly what you want MOST of the time
>>>>
>>>> but the problem is that it does not GUARANTEE that the same node will
>>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>>> thread that started after Facebook dropped Cassandra for their messaging
>>>> system, and understood some of the differences. what you want is essentially
>>>> what HBase does. the fundamental difference there is really due to the
>>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>>> server goes down, it really goes down, it can not re-grab the tablet from
>>>> the new tablet server without going through a start up protocol (notifying
>>>> the master, which would notify the clients etc),  in other words it is
>>>> guaranteed that one tablet is served by only one tablet server at any given
>>>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>>>> particular replica. HBase can have that guarantee because the group
>>>> membership is maintained by the strong failure detector.
>>>> just for hacking curiosity, a strong failure detector + Cassandra
>>>> replicas is not impossible (actually seems not difficult), although the
>>>> performance is not clear. what would such a strong failure detector bring to
>>>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>>>> question I think.
>>>> considering that HBase has been deployed on big clusters, it is probably
>>>> OK with the performance of the strong  Zookeeper failure detector. then a
>>>> further question was: why did Dynamo originally choose to use the
>>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>>> consistent", so the Phi-detector is **enough**, but if a strong detector
>>>> buys us more with little cost, wouldn't that  be great?
>>>>
>>>>
>>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>>
>>>>> Is this possible?
>>>>>
>>>>> All reads and writes for a given key will always go to the same node
>>>>> from a client.  It seems the only thing needed is to allow the clients to
>>>>> compute which node is the closes replica for the given key using the same
>>>>> algorithm C* uses.  When the first replica receives the write request, it
>>>>> will write to itself which should complete before any of the other replicas
>>>>> and then return.  The loads should still stay balanced if using random
>>>>> partitioner.  If the first replica becomes unavailable (however that is
>>>>> defined), then the clients can send to the next repilca in the ring and
>>>>> switch from ONE write/reads to QUORUM write/reads temporarily until the
>>>>> first replica becomes available again.  QUORUM is required since there could
>>>>> be some replicas that were not updated after the first replica went down.
>>>>>
>>>>> Will this work?  The goal is to have strong consistency with a
>>>>> read/write consistency level as low as possible while secondarily a network
>>>>> performance boost.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>>
>

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

I'm not proposing any changes to be done, but this looks like a very
interesting topic for thought/hack/learning, so the following are only
for thought exercises ....


HBase enforces a single write/read entry point, so you can achieve
strong consistency by writing/reading only one node.  but just writing
to one node exposes you to loss of data if that node fails. so the
region server HLog is replicated to 3 HDFS data nodes.  the
interesting thing here is that each replica sees a complete *prefix*
of the HLog: it won't miss a record, if a record sync() to a data node
fails, all the existing bytes in the block are replicated to a new
data node.

if we employ a similar "leader" node among the N replicas of
cassandra (coordinator always waits for the reply from leader, but
leader does not do further replication like in HBase or counters), the
leader sees all writes onto the key range, but the other replicas
could miss some writes, as a result, each of the non-leader replicas'
write history has some "holes", so when the leader dies, and when we
elect a new one, no one is going to have a complete history. so you'd
have to do a repair amongst all the replicas to reconstruct the full
history, which is slow.

it seems possible that we could utilize the FIFO property of the
InComingTCPConnection to simplify history reconstruction, just like
Zookeeper. if the IncomingTcpConnection of a replica fails, that means
that it may have missed some edits, then when it reconnects, we force
it to talk to the active leader first, to catch up to date. when the
leader dies, the next leader is elected to be the replica with the
most recent history.  by maintaining the property that each node has a
complete prefix of history, we only need to catch up on the tail of
history, and avoid doing a complete repair on the entire
memtable+SStable.  but one issue is that the history at the leader has
to be kept really long ----- if a non-leader replica goes off for 2
days, the leader has to keep all the history for 2 days to feed them
to the replica when it comes back online. but possibly this could be
limited to some max length so that over that length, the woken replica
simply does a complete bootstrap.


thanks
yang
On Sun, Jul 3, 2011 at 8:25 PM, AJ <aj...@dude.podzone.net> wrote:
> We seem to be having a fundamental misunderstanding.  Thanks for your
> comments. aj
>
> On 7/3/2011 8:28 PM, William Oberman wrote:
>
> I'm using cassandra as a tool, like a black box with a certain contract to
> the world.  Without modifying the "core", C* will send the updates to all
> replicas, so your plan would cause the extra write (for the placeholder).  I
> wasn't assuming a modification to how C* fundamentally works.
> Sounds like you are hacking (or at least looking) at the source, so all the
> power to you if/when you try these kind of changes.
> will
> On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj...@dude.podzone.net> wrote:
>>
>> On 7/3/2011 6:32 PM, William Oberman wrote:
>>
>> Was just going off of: " Send the value to the primary replica and send
>> placeholder values to the other replicas".  Sounded like you wanted to write
>> the value to one, and write the placeholder to N-1 to me.
>>
>> Yes, that is what I was suggesting.  The point of the placeholders is to
>> handle the crash case that I talked about... "like" a WAL does.
>>
>> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
>> just what it does anyways :-)
>> will
>>
>> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj...@dude.podzone.net> wrote:
>>>
>>> On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>
>>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>>> writes on a random node to do a single update (write placeholder, write
>>> update) and N*x writes from the client (write value, write placeholder to
>>> N-1). Where N is replication factor.  Seems like extra network and IO
>>> instead of less...
>>>
>>> To send the value to each node is 1.) unnecessary, 2.) will only cause a
>>> large burst of network traffic.  Think about if it's a large data value,
>>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>>> and doesn't significantly increase latency since they are all sent
>>> asynchronously.
>>>
>>>
>>> Of course, I still think this sounds like reimplementing Cassandra
>>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>>
>>> I don't see how.  Maybe you should take a peek at the source.
>>>
>>>
>>> On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:
>>>
>>> Yang,
>>>
>>> How would you deal with the problem when the 1st node responds success
>>> but then crashes before completely forwarding any replicas?  Then, after
>>> switching to the next primary, a read would return stale data.
>>>
>>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>>> send placeholder values to the other replicas.  The placeholder value is
>>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>>> timestamps 1 less than the timestamp for the actual value that went to the
>>> primary.  Later, when the changes propagate, the actual values will
>>> overwrite the placeholders.  In event of a crash before the placeholder gets
>>> overwritten, the next read value will tell the client so.  The client will
>>> report to the user that the key/column is unavailable.  The downside is
>>> you've overwritten your data and maybe would like to know what the old data
>>> was!  But, maybe there's another way using other columns or with MVCC.  The
>>> client would want a success from the primary and the secondary replicas to
>>> be certain of future read consistency in case the primary goes down
>>> immediately as I said above.  The ability to set an "update_pending" flag on
>>> any column value would probably make this work.  But, I'll think more on
>>> this later.
>>>
>>> aj
>>>
>>> On 7/2/2011 10:55 AM, Yang wrote:
>>>
>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>>> snitch, so this does roughly what you want MOST of the time
>>>
>>> but the problem is that it does not GUARANTEE that the same node will
>>> always be read.  I recently read into the HBase vs Cassandra comparison
>>> thread that started after Facebook dropped Cassandra for their messaging
>>> system, and understood some of the differences. what you want is essentially
>>> what HBase does. the fundamental difference there is really due to the
>>> gossip protocol: it's a probablistic, or eventually consistent failure
>>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>>> server goes down, it really goes down, it can not re-grab the tablet from
>>> the new tablet server without going through a start up protocol (notifying
>>> the master, which would notify the clients etc),  in other words it is
>>> guaranteed that one tablet is served by only one tablet server at any given
>>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>>> particular replica. HBase can have that guarantee because the group
>>> membership is maintained by the strong failure detector.
>>> just for hacking curiosity, a strong failure detector + Cassandra
>>> replicas is not impossible (actually seems not difficult), although the
>>> performance is not clear. what would such a strong failure detector bring to
>>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>>> question I think.
>>> considering that HBase has been deployed on big clusters, it is probably
>>> OK with the performance of the strong  Zookeeper failure detector. then a
>>> further question was: why did Dynamo originally choose to use the
>>> probablistic failure detector? yes Dynamo's main theme is "eventually
>>> consistent", so the Phi-detector is **enough**, but if a strong detector
>>> buys us more with little cost, wouldn't that  be great?
>>>
>>>
>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>>>>
>>>> Is this possible?
>>>>
>>>> All reads and writes for a given key will always go to the same node
>>>> from a client.  It seems the only thing needed is to allow the clients to
>>>> compute which node is the closes replica for the given key using the same
>>>> algorithm C* uses.  When the first replica receives the write request, it
>>>> will write to itself which should complete before any of the other replicas
>>>> and then return.  The loads should still stay balanced if using random
>>>> partitioner.  If the first replica becomes unavailable (however that is
>>>> defined), then the clients can send to the next repilca in the ring and
>>>> switch from ONE write/reads to QUORUM write/reads temporarily until the
>>>> first replica becomes available again.  QUORUM is required since there could
>>>> be some replicas that were not updated after the first replica went down.
>>>>
>>>> Will this work?  The goal is to have strong consistency with a
>>>> read/write consistency level as low as possible while secondarily a network
>>>> performance boost.
>>>
>>>
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

We seem to be having a fundamental misunderstanding.  Thanks for your 
comments. aj

On 7/3/2011 8:28 PM, William Oberman wrote:
> I'm using cassandra as a tool, like a black box with a certain 
> contract to the world.  Without modifying the "core", C* will send the 
> updates to all replicas, so your plan would cause the extra write (for 
> the placeholder).  I wasn't assuming a modification to how C* 
> fundamentally works.
>
> Sounds like you are hacking (or at least looking) at the source, so 
> all the power to you if/when you try these kind of changes.
>
> will
>
> On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj@dude.podzone.net 
> <ma...@dude.podzone.net>> wrote:
>
>     On 7/3/2011 6:32 PM, William Oberman wrote:
>>     Was just going off of: " Send the value to the primary replica
>>     and send placeholder values to the other replicas".  Sounded like
>>     you wanted to write the value to one, and write the placeholder
>>     to N-1 to me.
>
>     Yes, that is what I was suggesting.  The point of the placeholders
>     is to handle the crash case that I talked about... "like" a WAL does.
>
>
>>     But, C* will propagate the value to N-1 eventually anyways,
>>     'cause that's just what it does anyways :-)
>>
>>     will
>>
>>     On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj@dude.podzone.net
>>     <ma...@dude.podzone.net>> wrote:
>>
>>         On 7/3/2011 3:49 PM, Will Oberman wrote:
>>>         Why not send the value itself instead of a placeholder?  Now
>>>         it takes 2x writes on a random node to do a single update
>>>         (write placeholder, write update) and N*x writes from the
>>>         client (write value, write placeholder to N-1). Where N is
>>>         replication factor.  Seems like extra network and IO instead
>>>         of less...
>>
>>         To send the value to each node is 1.) unnecessary, 2.) will
>>         only cause a large burst of network traffic.  Think about if
>>         it's a large data value, such as a document.  Just let C* do
>>         it's thing.  The extra messages are tiny and doesn't
>>         significantly increase latency since they are all sent
>>         asynchronously.
>>
>>
>>>         Of course, I still think this sounds like reimplementing
>>>         Cassandra internals in a Cassandra client (just guessing,
>>>         I'm not a cassandra dev)
>>>
>>
>>         I don't see how.  Maybe you should take a peek at the source.
>>
>>
>>>
>>>         On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net
>>>         <ma...@dude.podzone.net>> wrote:
>>>
>>>>         Yang,
>>>>
>>>>         How would you deal with the problem when the 1st node
>>>>         responds success but then crashes before completely
>>>>         forwarding any replicas?  Then, after switching to the next
>>>>         primary, a read would return stale data.
>>>>
>>>>         Here's a quick-n-dirty way:  Send the value to the primary
>>>>         replica and send placeholder values to the other replicas. 
>>>>         The placeholder value is something like, "PENDING_UPDATE". 
>>>>         The placeholder values are sent with timestamps 1 less than
>>>>         the timestamp for the actual value that went to the
>>>>         primary.  Later, when the changes propagate, the actual
>>>>         values will overwrite the placeholders.  In event of a
>>>>         crash before the placeholder gets overwritten, the next
>>>>         read value will tell the client so.  The client will report
>>>>         to the user that the key/column is unavailable.  The
>>>>         downside is you've overwritten your data and maybe would
>>>>         like to know what the old data was!  But, maybe there's
>>>>         another way using other columns or with MVCC.  The client
>>>>         would want a success from the primary and the secondary
>>>>         replicas to be certain of future read consistency in case
>>>>         the primary goes down immediately as I said above.  The
>>>>         ability to set an "update_pending" flag on any column value
>>>>         would probably make this work.  But, I'll think more on
>>>>         this later.
>>>>
>>>>         aj
>>>>
>>>>         On 7/2/2011 10:55 AM, Yang wrote:
>>>>>         there is a JIRA completed in 0.7.x that "Prefers" a
>>>>>         certain node in snitch, so this does roughly what you want
>>>>>         MOST of the time
>>>>>
>>>>>
>>>>>         but the problem is that it does not GUARANTEE that the
>>>>>         same node will always be read.  I recently read into the
>>>>>         HBase vs Cassandra comparison thread that started after
>>>>>         Facebook dropped Cassandra for their messaging system, and
>>>>>         understood some of the differences. what you want is
>>>>>         essentially what HBase does. the fundamental difference
>>>>>         there is really due to the gossip protocol: it's a
>>>>>         probablistic, or eventually consistent failure detector
>>>>>          while HBase/Google Bigtable use Zookeeper/Chubby to
>>>>>         provide a strong failure detector (a distributed lock).
>>>>>          so in HBase, if a tablet server goes down, it really goes
>>>>>         down, it can not re-grab the tablet from the new tablet
>>>>>         server without going through a start up protocol
>>>>>         (notifying the master, which would notify the clients
>>>>>         etc),  in other words it is guaranteed that one tablet is
>>>>>         served by only one tablet server at any given time.  in
>>>>>         comparison the above JIRA only TRYIES to serve that key
>>>>>         from one particular replica. HBase can have that guarantee
>>>>>         because the group membership is maintained by the strong
>>>>>         failure detector.
>>>>>
>>>>>         just for hacking curiosity, a strong failure detector +
>>>>>         Cassandra replicas is not impossible (actually seems not
>>>>>         difficult), although the performance is not clear. what
>>>>>         would such a strong failure detector bring to Cassandra
>>>>>         besides this ONE-ONE strong consistency ? that is an
>>>>>         interesting question I think.
>>>>>
>>>>>         considering that HBase has been deployed on big clusters,
>>>>>         it is probably OK with the performance of the strong
>>>>>          Zookeeper failure detector. then a further question was:
>>>>>         why did Dynamo originally choose to use the probablistic
>>>>>         failure detector? yes Dynamo's main theme is "eventually
>>>>>         consistent", so the Phi-detector is **enough**, but if a
>>>>>         strong detector buys us more with little cost, wouldn't
>>>>>         that  be great?
>>>>>
>>>>>
>>>>>
>>>>>         On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net
>>>>>         <ma...@dude.podzone.net>> wrote:
>>>>>
>>>>>             Is this possible?
>>>>>
>>>>>             All reads and writes for a given key will always go to
>>>>>             the same node from a client.  It seems the only thing
>>>>>             needed is to allow the clients to compute which node
>>>>>             is the closes replica for the given key using the same
>>>>>             algorithm C* uses.  When the first replica receives
>>>>>             the write request, it will write to itself which
>>>>>             should complete before any of the other replicas and
>>>>>             then return.  The loads should still stay balanced if
>>>>>             using random partitioner.  If the first replica
>>>>>             becomes unavailable (however that is defined), then
>>>>>             the clients can send to the next repilca in the ring
>>>>>             and switch from ONE write/reads to QUORUM write/reads
>>>>>             temporarily until the first replica becomes available
>>>>>             again.  QUORUM is required since there could be some
>>>>>             replicas that were not updated after the first replica
>>>>>             went down.
>>>>>
>>>>>             Will this work?  The goal is to have strong
>>>>>             consistency with a read/write consistency level as low
>>>>>             as possible while secondarily a network performance boost.
>>>>>
>>>>>
>>>>
>>
>>
>>
>>
>>     -- 
>>     Will Oberman
>>     Civic Science, Inc.
>>     3030 Penn Avenue., First Floor
>>     Pittsburgh, PA 15201
>>     (M) 412-480-7835 <tel:412-480-7835>
>>     (E) oberman@civicscience.com <ma...@civicscience.com>
>
>
>
>
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com <ma...@civicscience.com>

Re: Strong Consistency with ONE read/writes

Posted by William Oberman <ob...@civicscience.com>.

I'm using cassandra as a tool, like a black box with a certain contract to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).  I
wasn't assuming a modification to how C* fundamentally works.

Sounds like you are hacking (or at least looking) at the source, so all the
power to you if/when you try these kind of changes.

will

On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj...@dude.podzone.net> wrote:

> **
> On 7/3/2011 6:32 PM, William Oberman wrote:
>
> Was just going off of: " Send the value to the primary replica and send
> placeholder values to the other replicas".  Sounded like you wanted to write
> the value to one, and write the placeholder to N-1 to me.
>
>
> Yes, that is what I was suggesting.  The point of the placeholders is to
> handle the crash case that I talked about... "like" a WAL does.
>
>
> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
> just what it does anyways :-)
>
>  will
>
> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj...@dude.podzone.net> wrote:
>
>>  On 7/3/2011 3:49 PM, Will Oberman wrote:
>>
>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>> writes on a random node to do a single update (write placeholder, write
>> update) and N*x writes from the client (write value, write placeholder to
>> N-1). Where N is replication factor.  Seems like extra network and IO
>> instead of less...
>>
>>
>>  To send the value to each node is 1.) unnecessary, 2.) will only cause a
>> large burst of network traffic.  Think about if it's a large data value,
>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>> and doesn't significantly increase latency since they are all sent
>> asynchronously.
>>
>>
>>
>> Of course, I still think this sounds like reimplementing Cassandra
>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>
>>
>>
>>  I don't see how.  Maybe you should take a peek at the source.
>>
>>
>>
>> On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:
>>
>>   Yang,
>>
>> How would you deal with the problem when the 1st node responds success but
>> then crashes before completely forwarding any replicas?  Then, after
>> switching to the next primary, a read would return stale data.
>>
>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>> send placeholder values to the other replicas.  The placeholder value is
>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>> timestamps 1 less than the timestamp for the actual value that went to the
>> primary.  Later, when the changes propagate, the actual values will
>> overwrite the placeholders.  In event of a crash before the placeholder gets
>> overwritten, the next read value will tell the client so.  The client will
>> report to the user that the key/column is unavailable.  The downside is
>> you've overwritten your data and maybe would like to know what the old data
>> was!  But, maybe there's another way using other columns or with MVCC.  The
>> client would want a success from the primary and the secondary replicas to
>> be certain of future read consistency in case the primary goes down
>> immediately as I said above.  The ability to set an "update_pending" flag on
>> any column value would probably make this work.  But, I'll think more on
>> this later.
>>
>> aj
>>
>> On 7/2/2011 10:55 AM, Yang wrote:
>>
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>>  but the problem is that it does not GUARANTEE that the same node will
>> always be read.  I recently read into the HBase vs Cassandra comparison
>> thread that started after Facebook dropped Cassandra for their messaging
>> system, and understood some of the differences. what you want is essentially
>> what HBase does. the fundamental difference there is really due to the
>> gossip protocol: it's a probablistic, or eventually consistent failure
>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>> server goes down, it really goes down, it can not re-grab the tablet from
>> the new tablet server without going through a start up protocol (notifying
>> the master, which would notify the clients etc),  in other words it is
>> guaranteed that one tablet is served by only one tablet server at any given
>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>> particular replica. HBase can have that guarantee because the group
>> membership is maintained by the strong failure detector.
>>
>>  just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although the
>> performance is not clear. what would such a strong failure detector bring to
>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>> question I think.
>>
>>  considering that HBase has been deployed on big clusters, it is probably
>> OK with the performance of the strong  Zookeeper failure detector. then a
>> further question was: why did Dynamo originally choose to use the
>> probablistic failure detector? yes Dynamo's main theme is "eventually
>> consistent", so the Phi-detector is **enough**, but if a strong detector
>> buys us more with little cost, wouldn't that  be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>>
>>> Is this possible?
>>>
>>> All reads and writes for a given key will always go to the same node from
>>> a client.  It seems the only thing needed is to allow the clients to compute
>>> which node is the closes replica for the given key using the same algorithm
>>> C* uses.  When the first replica receives the write request, it will write
>>> to itself which should complete before any of the other replicas and then
>>> return.  The loads should still stay balanced if using random partitioner.
>>>  If the first replica becomes unavailable (however that is defined), then
>>> the clients can send to the next repilca in the ring and switch from ONE
>>> write/reads to QUORUM write/reads temporarily until the first replica
>>> becomes available again.  QUORUM is required since there could be some
>>> replicas that were not updated after the first replica went down.
>>>
>>> Will this work?  The goal is to have strong consistency with a read/write
>>> consistency level as low as possible while secondarily a network performance
>>> boost.
>>>
>>
>>
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

On 7/3/2011 6:32 PM, William Oberman wrote:
> Was just going off of: " Send the value to the primary replica and 
> send placeholder values to the other replicas".  Sounded like you 
> wanted to write the value to one, and write the placeholder to N-1 to me.

Yes, that is what I was suggesting.  The point of the placeholders is to 
handle the crash case that I talked about... "like" a WAL does.

> But, C* will propagate the value to N-1 eventually anyways, 'cause 
> that's just what it does anyways :-)
>
> will
>
> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj@dude.podzone.net 
> <ma...@dude.podzone.net>> wrote:
>
>     On 7/3/2011 3:49 PM, Will Oberman wrote:
>>     Why not send the value itself instead of a placeholder?  Now it
>>     takes 2x writes on a random node to do a single update (write
>>     placeholder, write update) and N*x writes from the client (write
>>     value, write placeholder to N-1). Where N is replication factor.
>>      Seems like extra network and IO instead of less...
>
>     To send the value to each node is 1.) unnecessary, 2.) will only
>     cause a large burst of network traffic.  Think about if it's a
>     large data value, such as a document.  Just let C* do it's thing. 
>     The extra messages are tiny and doesn't significantly increase
>     latency since they are all sent asynchronously.
>
>
>>     Of course, I still think this sounds like reimplementing
>>     Cassandra internals in a Cassandra client (just guessing, I'm not
>>     a cassandra dev)
>>
>
>     I don't see how.  Maybe you should take a peek at the source.
>
>
>>
>>     On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net
>>     <ma...@dude.podzone.net>> wrote:
>>
>>>     Yang,
>>>
>>>     How would you deal with the problem when the 1st node responds
>>>     success but then crashes before completely forwarding any
>>>     replicas?  Then, after switching to the next primary, a read
>>>     would return stale data.
>>>
>>>     Here's a quick-n-dirty way:  Send the value to the primary
>>>     replica and send placeholder values to the other replicas.  The
>>>     placeholder value is something like, "PENDING_UPDATE".  The
>>>     placeholder values are sent with timestamps 1 less than the
>>>     timestamp for the actual value that went to the primary.  Later,
>>>     when the changes propagate, the actual values will overwrite the
>>>     placeholders.  In event of a crash before the placeholder gets
>>>     overwritten, the next read value will tell the client so.  The
>>>     client will report to the user that the key/column is
>>>     unavailable.  The downside is you've overwritten your data and
>>>     maybe would like to know what the old data was!  But, maybe
>>>     there's another way using other columns or with MVCC.  The
>>>     client would want a success from the primary and the secondary
>>>     replicas to be certain of future read consistency in case the
>>>     primary goes down immediately as I said above.  The ability to
>>>     set an "update_pending" flag on any column value would probably
>>>     make this work.  But, I'll think more on this later.
>>>
>>>     aj
>>>
>>>     On 7/2/2011 10:55 AM, Yang wrote:
>>>>     there is a JIRA completed in 0.7.x that "Prefers" a certain
>>>>     node in snitch, so this does roughly what you want MOST of the
>>>>     time
>>>>
>>>>
>>>>     but the problem is that it does not GUARANTEE that the same
>>>>     node will always be read.  I recently read into the HBase vs
>>>>     Cassandra comparison thread that started after Facebook dropped
>>>>     Cassandra for their messaging system, and understood some of
>>>>     the differences. what you want is essentially what HBase does.
>>>>     the fundamental difference there is really due to the gossip
>>>>     protocol: it's a probablistic, or eventually consistent failure
>>>>     detector  while HBase/Google Bigtable use Zookeeper/Chubby to
>>>>     provide a strong failure detector (a distributed lock).  so in
>>>>     HBase, if a tablet server goes down, it really goes down, it
>>>>     can not re-grab the tablet from the new tablet server without
>>>>     going through a start up protocol (notifying the master, which
>>>>     would notify the clients etc),  in other words it is guaranteed
>>>>     that one tablet is served by only one tablet server at any
>>>>     given time.  in comparison the above JIRA only TRYIES to serve
>>>>     that key from one particular replica. HBase can have that
>>>>     guarantee because the group membership is maintained by the
>>>>     strong failure detector.
>>>>
>>>>     just for hacking curiosity, a strong failure detector +
>>>>     Cassandra replicas is not impossible (actually seems not
>>>>     difficult), although the performance is not clear. what would
>>>>     such a strong failure detector bring to Cassandra besides this
>>>>     ONE-ONE strong consistency ? that is an interesting question I
>>>>     think.
>>>>
>>>>     considering that HBase has been deployed on big clusters, it is
>>>>     probably OK with the performance of the strong  Zookeeper
>>>>     failure detector. then a further question was: why did Dynamo
>>>>     originally choose to use the probablistic failure detector? yes
>>>>     Dynamo's main theme is "eventually consistent", so the
>>>>     Phi-detector is **enough**, but if a strong detector buys us
>>>>     more with little cost, wouldn't that  be great?
>>>>
>>>>
>>>>
>>>>     On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net
>>>>     <ma...@dude.podzone.net>> wrote:
>>>>
>>>>         Is this possible?
>>>>
>>>>         All reads and writes for a given key will always go to the
>>>>         same node from a client.  It seems the only thing needed is
>>>>         to allow the clients to compute which node is the closes
>>>>         replica for the given key using the same algorithm C* uses.
>>>>          When the first replica receives the write request, it will
>>>>         write to itself which should complete before any of the
>>>>         other replicas and then return.  The loads should still
>>>>         stay balanced if using random partitioner.  If the first
>>>>         replica becomes unavailable (however that is defined), then
>>>>         the clients can send to the next repilca in the ring and
>>>>         switch from ONE write/reads to QUORUM write/reads
>>>>         temporarily until the first replica becomes available
>>>>         again.  QUORUM is required since there could be some
>>>>         replicas that were not updated after the first replica went
>>>>         down.
>>>>
>>>>         Will this work?  The goal is to have strong consistency
>>>>         with a read/write consistency level as low as possible
>>>>         while secondarily a network performance boost.
>>>>
>>>>
>>>
>
>
>
>
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com <ma...@civicscience.com>

Re: Strong Consistency with ONE read/writes

Posted by William Oberman <ob...@civicscience.com>.

Was just going off of: "Send the value to the primary replica and send
placeholder values to the other replicas".  Sounded like you wanted to write
the value to one, and write the placeholder to N-1 to me.  But, C* will
propagate the value to N-1 eventually anyways, 'cause that's just what it
does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj...@dude.podzone.net> wrote:

> **
> On 7/3/2011 3:49 PM, Will Oberman wrote:
>
> Why not send the value itself instead of a placeholder?  Now it takes 2x
> writes on a random node to do a single update (write placeholder, write
> update) and N*x writes from the client (write value, write placeholder to
> N-1). Where N is replication factor.  Seems like extra network and IO
> instead of less...
>
>
> To send the value to each node is 1.) unnecessary, 2.) will only cause a
> large burst of network traffic.  Think about if it's a large data value,
> such as a document.  Just let C* do it's thing.  The extra messages are tiny
> and doesn't significantly increase latency since they are all sent
> asynchronously.
>
>
>
> Of course, I still think this sounds like reimplementing Cassandra
> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>
>
> I don't see how.  Maybe you should take a peek at the source.
>
>
>
> On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:
>
>   Yang,
>
> How would you deal with the problem when the 1st node responds success but
> then crashes before completely forwarding any replicas?  Then, after
> switching to the next primary, a read would return stale data.
>
> Here's a quick-n-dirty way:  Send the value to the primary replica and send
> placeholder values to the other replicas.  The placeholder value is
> something like, "PENDING_UPDATE".  The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to the
> primary.  Later, when the changes propagate, the actual values will
> overwrite the placeholders.  In event of a crash before the placeholder gets
> overwritten, the next read value will tell the client so.  The client will
> report to the user that the key/column is unavailable.  The downside is
> you've overwritten your data and maybe would like to know what the old data
> was!  But, maybe there's another way using other columns or with MVCC.  The
> client would want a success from the primary and the secondary replicas to
> be certain of future read consistency in case the primary goes down
> immediately as I said above.  The ability to set an "update_pending" flag on
> any column value would probably make this work.  But, I'll think more on
> this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>
> there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch,
> so this does roughly what you want MOST of the time
>
>
>  but the problem is that it does not GUARANTEE that the same node will
> always be read.  I recently read into the HBase vs Cassandra comparison
> thread that started after Facebook dropped Cassandra for their messaging
> system, and understood some of the differences. what you want is essentially
> what HBase does. the fundamental difference there is really due to the
> gossip protocol: it's a probablistic, or eventually consistent failure
> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
> strong failure detector (a distributed lock).  so in HBase, if a tablet
> server goes down, it really goes down, it can not re-grab the tablet from
> the new tablet server without going through a start up protocol (notifying
> the master, which would notify the clients etc),  in other words it is
> guaranteed that one tablet is served by only one tablet server at any given
> time.  in comparison the above JIRA only TRYIES to serve that key from one
> particular replica. HBase can have that guarantee because the group
> membership is maintained by the strong failure detector.
>
>  just for hacking curiosity, a strong failure detector + Cassandra
> replicas is not impossible (actually seems not difficult), although the
> performance is not clear. what would such a strong failure detector bring to
> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
> question I think.
>
>  considering that HBase has been deployed on big clusters, it is probably
> OK with the performance of the strong  Zookeeper failure detector. then a
> further question was: why did Dynamo originally choose to use the
> probablistic failure detector? yes Dynamo's main theme is "eventually
> consistent", so the Phi-detector is **enough**, but if a strong detector
> buys us more with little cost, wouldn't that  be great?
>
>
>
> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same node from
>> a client.  It seems the only thing needed is to allow the clients to compute
>> which node is the closes replica for the given key using the same algorithm
>> C* uses.  When the first replica receives the write request, it will write
>> to itself which should complete before any of the other replicas and then
>> return.  The loads should still stay balanced if using random partitioner.
>>  If the first replica becomes unavailable (however that is defined), then
>> the clients can send to the next repilca in the ring and switch from ONE
>> write/reads to QUORUM write/reads temporarily until the first replica
>> becomes available again.  QUORUM is required since there could be some
>> replicas that were not updated after the first replica went down.
>>
>> Will this work?  The goal is to have strong consistency with a read/write
>> consistency level as low as possible while secondarily a network performance
>> boost.
>>
>
>
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

On 7/3/2011 3:49 PM, Will Oberman wrote:
> Why not send the value itself instead of a placeholder?  Now it takes 
> 2x writes on a random node to do a single update (write placeholder, 
> write update) and N*x writes from the client (write value, write 
> placeholder to N-1). Where N is replication factor.  Seems like extra 
> network and IO instead of less...

To send the value to each node is 1.) unnecessary, 2.) will only cause a 
large burst of network traffic.  Think about if it's a large data value, 
such as a document.  Just let C* do it's thing.  The extra messages are 
tiny and doesn't significantly increase latency since they are all sent 
asynchronously.

> Of course, I still think this sounds like reimplementing Cassandra 
> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>

I don't see how.  Maybe you should take a peek at the source.

>
> On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net 
> <ma...@dude.podzone.net>> wrote:
>
>> Yang,
>>
>> How would you deal with the problem when the 1st node responds 
>> success but then crashes before completely forwarding any replicas?  
>> Then, after switching to the next primary, a read would return stale 
>> data.
>>
>> Here's a quick-n-dirty way:  Send the value to the primary replica 
>> and send placeholder values to the other replicas.  The placeholder 
>> value is something like, "PENDING_UPDATE".  The placeholder values 
>> are sent with timestamps 1 less than the timestamp for the actual 
>> value that went to the primary.  Later, when the changes propagate, 
>> the actual values will overwrite the placeholders.  In event of a 
>> crash before the placeholder gets overwritten, the next read value 
>> will tell the client so.  The client will report to the user that the 
>> key/column is unavailable.  The downside is you've overwritten your 
>> data and maybe would like to know what the old data was!  But, maybe 
>> there's another way using other columns or with MVCC.  The client 
>> would want a success from the primary and the secondary replicas to 
>> be certain of future read consistency in case the primary goes down 
>> immediately as I said above.  The ability to set an "update_pending" 
>> flag on any column value would probably make this work.  But, I'll 
>> think more on this later.
>>
>> aj
>>
>> On 7/2/2011 10:55 AM, Yang wrote:
>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
>>> snitch, so this does roughly what you want MOST of the time
>>>
>>>
>>> but the problem is that it does not GUARANTEE that the same node 
>>> will always be read.  I recently read into the HBase vs Cassandra 
>>> comparison thread that started after Facebook dropped Cassandra for 
>>> their messaging system, and understood some of the differences. what 
>>> you want is essentially what HBase does. the fundamental difference 
>>> there is really due to the gossip protocol: it's a probablistic, or 
>>> eventually consistent failure detector  while HBase/Google Bigtable 
>>> use Zookeeper/Chubby to provide a strong failure detector (a 
>>> distributed lock).  so in HBase, if a tablet server goes down, it 
>>> really goes down, it can not re-grab the tablet from the new tablet 
>>> server without going through a start up protocol (notifying the 
>>> master, which would notify the clients etc),  in other words it is 
>>> guaranteed that one tablet is served by only one tablet server at 
>>> any given time.  in comparison the above JIRA only TRYIES to serve 
>>> that key from one particular replica. HBase can have that guarantee 
>>> because the group membership is maintained by the strong failure 
>>> detector.
>>>
>>> just for hacking curiosity, a strong failure detector + Cassandra 
>>> replicas is not impossible (actually seems not difficult), although 
>>> the performance is not clear. what would such a strong failure 
>>> detector bring to Cassandra besides this ONE-ONE strong consistency 
>>> ? that is an interesting question I think.
>>>
>>> considering that HBase has been deployed on big clusters, it is 
>>> probably OK with the performance of the strong  Zookeeper failure 
>>> detector. then a further question was: why did Dynamo originally 
>>> choose to use the probablistic failure detector? yes Dynamo's main 
>>> theme is "eventually consistent", so the Phi-detector is **enough**, 
>>> but if a strong detector buys us more with little cost, wouldn't 
>>> that  be great?
>>>
>>>
>>>
>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net 
>>> <ma...@dude.podzone.net>> wrote:
>>>
>>>     Is this possible?
>>>
>>>     All reads and writes for a given key will always go to the same
>>>     node from a client.  It seems the only thing needed is to allow
>>>     the clients to compute which node is the closes replica for the
>>>     given key using the same algorithm C* uses.  When the first
>>>     replica receives the write request, it will write to itself
>>>     which should complete before any of the other replicas and then
>>>     return.  The loads should still stay balanced if using random
>>>     partitioner.  If the first replica becomes unavailable (however
>>>     that is defined), then the clients can send to the next repilca
>>>     in the ring and switch from ONE write/reads to QUORUM
>>>     write/reads temporarily until the first replica becomes
>>>     available again.  QUORUM is required since there could be some
>>>     replicas that were not updated after the first replica went down.
>>>
>>>     Will this work?  The goal is to have strong consistency with a
>>>     read/write consistency level as low as possible while
>>>     secondarily a network performance boost.
>>>
>>>
>>

Re: Strong Consistency with ONE read/writes

Posted by Will Oberman <ob...@civicscience.com>.

Why not send the value itself instead of a placeholder?  Now it takes  
2x writes on a random node to do a single update (write placeholder,  
write update) and N*x writes from the client (write value, write  
placeholder to N-1). Where N is replication factor.  Seems like extra  
network and IO instead of less...

Of course, I still think this sounds like reimplementing Cassandra  
internals in a Cassandra client (just guessing, I'm not a cassandra dev)


On Jul 3, 2011, at 5:20 PM, AJ <aj...@dude.podzone.net> wrote:

> Yang,
>
> How would you deal with the problem when the 1st node responds  
> success but then crashes before completely forwarding any replicas?   
> Then, after switching to the next primary, a read would return stale  
> data.
>
> Here's a quick-n-dirty way:  Send the value to the primary replica  
> and send placeholder values to the other replicas.  The placeholder  
> value is something like, "PENDING_UPDATE".  The placeholder values  
> are sent with timestamps 1 less than the timestamp for the actual  
> value that went to the primary.  Later, when the changes propagate,  
> the actual values will overwrite the placeholders.  In event of a  
> crash before the placeholder gets overwritten, the next read value  
> will tell the client so.  The client will report to the user that  
> the key/column is unavailable.  The downside is you've overwritten  
> your data and maybe would like to know what the old data was!  But,  
> maybe there's another way using other columns or with MVCC.  The  
> client would want a success from the primary and the secondary  
> replicas to be certain of future read consistency in case the  
> primary goes down immediately as I said above.  The ability to set  
> an "update_pending" flag on any column value would probably make  
> this work.  But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>>
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in  
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node  
>> will always be read.  I recently read into the HBase vs Cassandra  
>> comparison thread that started after Facebook dropped Cassandra for  
>> their messaging system, and understood some of the differences.  
>> what you want is essentially what HBase does. the fundamental  
>> difference there is really due to the gossip protocol: it's a  
>> probablistic, or eventually consistent failure detector  while  
>> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong  
>> failure detector (a distributed lock).  so in HBase, if a tablet  
>> server goes down, it really goes down, it can not re-grab the  
>> tablet from the new tablet server without going through a start up  
>> protocol (notifying the master, which would notify the clients  
>> etc),  in other words it is guaranteed that one tablet is served by  
>> only one tablet server at any given time.  in comparison the above  
>> JIRA only TRYIES to serve that key from one particular replica.  
>> HBase can have that guarantee because the group membership is  
>> maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra  
>> replicas is not impossible (actually seems not difficult), although  
>> the performance is not clear. what would such a strong failure  
>> detector bring to Cassandra besides this ONE-ONE strong  
>> consistency ? that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is  
>> probably OK with the performance of the strong  Zookeeper failure  
>> detector. then a further question was: why did Dynamo originally  
>> choose to use the probablistic failure detector? yes Dynamo's main  
>> theme is "eventually consistent", so the Phi-detector is  
>> **enough**, but if a strong detector buys us more with little cost,  
>> wouldn't that  be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same  
>> node from a client.  It seems the only thing needed is to allow the  
>> clients to compute which node is the closes replica for the given  
>> key using the same algorithm C* uses.  When the first replica  
>> receives the write request, it will write to itself which should  
>> complete before any of the other replicas and then return.  The  
>> loads should still stay balanced if using random partitioner.  If  
>> the first replica becomes unavailable (however that is defined),  
>> then the             clients can send to the next repilca in the  
>> ring and switch from ONE write/reads to QUORUM write/reads  
>> temporarily until the first replica becomes available again.   
>> QUORUM is required since there could be some replicas that were not  
>> updated after the first replica went down.
>>
>> Will this work?  The goal is to have strong consistency with a read/ 
>> write consistency level as low as possible while secondarily a  
>> network performance boost.
>>
>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

On 7/3/2011 4:07 PM, Yang wrote:
>
> I'm no expert. So addressing the question to me probably give you real 
> answers :)
>
> The single entry mode makes sure that all writes coming through the 
> leader are received by replicas before ack to client. Probably wont be 
> stale data
>

That doesn't sound any different than a TWO write.  I'm trying to save a 
hop (+ 1 data xfer) by ack'ing immediately after the primary 
successfully writes, i.e., ONE write.

> On Jul 3, 2011 11:20 AM, "AJ" <aj@dude.podzone.net 
> <ma...@dude.podzone.net>> wrote:
> > Yang,
> >
> > How would you deal with the problem when the 1st node responds success
> > but then crashes before completely forwarding any replicas? Then, after
> > switching to the next primary, a read would return stale data.
> >
> > Here's a quick-n-dirty way: Send the value to the primary replica and
> > send placeholder values to the other replicas. The placeholder value is
> > something like, "PENDING_UPDATE". The placeholder values are sent with
> > timestamps 1 less than the timestamp for the actual value that went to
> > the primary. Later, when the changes propagate, the actual values will
> > overwrite the placeholders. In event of a crash before the placeholder
> > gets overwritten, the next read value will tell the client so. The
> > client will report to the user that the key/column is unavailable. The
> > downside is you've overwritten your data and maybe would like to know
> > what the old data was! But, maybe there's another way using other
> > columns or with MVCC. The client would want a success from the primary
> > and the secondary replicas to be certain of future read consistency in
> > case the primary goes down immediately as I said above. The ability to
> > set an "update_pending" flag on any column value would probably make
> > this work. But, I'll think more on this later.
> >
> > aj
> >
> > On 7/2/2011 10:55 AM, Yang wrote:
> >> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
> >> snitch, so this does roughly what you want MOST of the time
> >>
> >>
> >> but the problem is that it does not GUARANTEE that the same node will
> >> always be read. I recently read into the HBase vs Cassandra
> >> comparison thread that started after Facebook dropped Cassandra for
> >> their messaging system, and understood some of the differences. what
> >> you want is essentially what HBase does. the fundamental difference
> >> there is really due to the gossip protocol: it's a probablistic, or
> >> eventually consistent failure detector while HBase/Google Bigtable
> >> use Zookeeper/Chubby to provide a strong failure detector (a
> >> distributed lock). so in HBase, if a tablet server goes down, it
> >> really goes down, it can not re-grab the tablet from the new tablet
> >> server without going through a start up protocol (notifying the
> >> master, which would notify the clients etc), in other words it is
> >> guaranteed that one tablet is served by only one tablet server at any
> >> given time. in comparison the above JIRA only TRYIES to serve that
> >> key from one particular replica. HBase can have that guarantee because
> >> the group membership is maintained by the strong failure detector.
> >>
> >> just for hacking curiosity, a strong failure detector + Cassandra
> >> replicas is not impossible (actually seems not difficult), although
> >> the performance is not clear. what would such a strong failure
> >> detector bring to Cassandra besides this ONE-ONE strong consistency ?
> >> that is an interesting question I think.
> >>
> >> considering that HBase has been deployed on big clusters, it is
> >> probably OK with the performance of the strong Zookeeper failure
> >> detector. then a further question was: why did Dynamo originally
> >> choose to use the probablistic failure detector? yes Dynamo's main
> >> theme is "eventually consistent", so the Phi-detector is **enough**,
> >> but if a strong detector buys us more with little cost, wouldn't that
> >> be great?
> >>
> >>
> >>
> >> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net 
> <ma...@dude.podzone.net>
> >> <mailto:aj@dude.podzone.net <ma...@dude.podzone.net>>> wrote:
> >>
> >> Is this possible?
> >>
> >> All reads and writes for a given key will always go to the same
> >> node from a client. It seems the only thing needed is to allow
> >> the clients to compute which node is the closes replica for the
> >> given key using the same algorithm C* uses. When the first
> >> replica receives the write request, it will write to itself which
> >> should complete before any of the other replicas and then return.
> >> The loads should still stay balanced if using random partitioner.
> >> If the first replica becomes unavailable (however that is
> >> defined), then the clients can send to the next repilca in the
> >> ring and switch from ONE write/reads to QUORUM write/reads
> >> temporarily until the first replica becomes available again.
> >> QUORUM is required since there could be some replicas that were
> >> not updated after the first replica went down.
> >>
> >> Will this work? The goal is to have strong consistency with a
> >> read/write consistency level as low as possible while secondarily
> >> a network performance boost.
> >>
> >>
> >

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

I'm no expert. So addressing the question to me probably give you real
answers :)

The single entry mode makes sure that all writes coming through the leader
are received by replicas before ack to client. Probably wont be stale data
On Jul 3, 2011 11:20 AM, "AJ" <aj...@dude.podzone.net> wrote:
> Yang,
>
> How would you deal with the problem when the 1st node responds success
> but then crashes before completely forwarding any replicas? Then, after
> switching to the next primary, a read would return stale data.
>
> Here's a quick-n-dirty way: Send the value to the primary replica and
> send placeholder values to the other replicas. The placeholder value is
> something like, "PENDING_UPDATE". The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to
> the primary. Later, when the changes propagate, the actual values will
> overwrite the placeholders. In event of a crash before the placeholder
> gets overwritten, the next read value will tell the client so. The
> client will report to the user that the key/column is unavailable. The
> downside is you've overwritten your data and maybe would like to know
> what the old data was! But, maybe there's another way using other
> columns or with MVCC. The client would want a success from the primary
> and the secondary replicas to be certain of future read consistency in
> case the primary goes down immediately as I said above. The ability to
> set an "update_pending" flag on any column value would probably make
> this work. But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node will
>> always be read. I recently read into the HBase vs Cassandra
>> comparison thread that started after Facebook dropped Cassandra for
>> their messaging system, and understood some of the differences. what
>> you want is essentially what HBase does. the fundamental difference
>> there is really due to the gossip protocol: it's a probablistic, or
>> eventually consistent failure detector while HBase/Google Bigtable
>> use Zookeeper/Chubby to provide a strong failure detector (a
>> distributed lock). so in HBase, if a tablet server goes down, it
>> really goes down, it can not re-grab the tablet from the new tablet
>> server without going through a start up protocol (notifying the
>> master, which would notify the clients etc), in other words it is
>> guaranteed that one tablet is served by only one tablet server at any
>> given time. in comparison the above JIRA only TRYIES to serve that
>> key from one particular replica. HBase can have that guarantee because
>> the group membership is maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although
>> the performance is not clear. what would such a strong failure
>> detector bring to Cassandra besides this ONE-ONE strong consistency ?
>> that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is
>> probably OK with the performance of the strong Zookeeper failure
>> detector. then a further question was: why did Dynamo originally
>> choose to use the probablistic failure detector? yes Dynamo's main
>> theme is "eventually consistent", so the Phi-detector is **enough**,
>> but if a strong detector buys us more with little cost, wouldn't that
>> be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net
>> <ma...@dude.podzone.net>> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same
>> node from a client. It seems the only thing needed is to allow
>> the clients to compute which node is the closes replica for the
>> given key using the same algorithm C* uses. When the first
>> replica receives the write request, it will write to itself which
>> should complete before any of the other replicas and then return.
>> The loads should still stay balanced if using random partitioner.
>> If the first replica becomes unavailable (however that is
>> defined), then the clients can send to the next repilca in the
>> ring and switch from ONE write/reads to QUORUM write/reads
>> temporarily until the first replica becomes available again.
>> QUORUM is required since there could be some replicas that were
>> not updated after the first replica went down.
>>
>> Will this work? The goal is to have strong consistency with a
>> read/write consistency level as low as possible while secondarily
>> a network performance boost.
>>
>>
>

Re: Strong Consistency with ONE read/writes

Posted by AJ <aj...@dude.podzone.net>.

Yang,

How would you deal with the problem when the 1st node responds success 
but then crashes before completely forwarding any replicas?  Then, after 
switching to the next primary, a read would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary replica and 
send placeholder values to the other replicas.  The placeholder value is 
something like, "PENDING_UPDATE".  The placeholder values are sent with 
timestamps 1 less than the timestamp for the actual value that went to 
the primary.  Later, when the changes propagate, the actual values will 
overwrite the placeholders.  In event of a crash before the placeholder 
gets overwritten, the next read value will tell the client so.  The 
client will report to the user that the key/column is unavailable.  The 
downside is you've overwritten your data and maybe would like to know 
what the old data was!  But, maybe there's another way using other 
columns or with MVCC.  The client would want a success from the primary 
and the secondary replicas to be certain of future read consistency in 
case the primary goes down immediately as I said above.  The ability to 
set an "update_pending" flag on any column value would probably make 
this work.  But, I'll think more on this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
> snitch, so this does roughly what you want MOST of the time
>
>
> but the problem is that it does not GUARANTEE that the same node will 
> always be read.  I recently read into the HBase vs Cassandra 
> comparison thread that started after Facebook dropped Cassandra for 
> their messaging system, and understood some of the differences. what 
> you want is essentially what HBase does. the fundamental difference 
> there is really due to the gossip protocol: it's a probablistic, or 
> eventually consistent failure detector  while HBase/Google Bigtable 
> use Zookeeper/Chubby to provide a strong failure detector (a 
> distributed lock).  so in HBase, if a tablet server goes down, it 
> really goes down, it can not re-grab the tablet from the new tablet 
> server without going through a start up protocol (notifying the 
> master, which would notify the clients etc),  in other words it is 
> guaranteed that one tablet is served by only one tablet server at any 
> given time.  in comparison the above JIRA only TRYIES to serve that 
> key from one particular replica. HBase can have that guarantee because 
> the group membership is maintained by the strong failure detector.
>
> just for hacking curiosity, a strong failure detector + Cassandra 
> replicas is not impossible (actually seems not difficult), although 
> the performance is not clear. what would such a strong failure 
> detector bring to Cassandra besides this ONE-ONE strong consistency ? 
> that is an interesting question I think.
>
> considering that HBase has been deployed on big clusters, it is 
> probably OK with the performance of the strong  Zookeeper failure 
> detector. then a further question was: why did Dynamo originally 
> choose to use the probablistic failure detector? yes Dynamo's main 
> theme is "eventually consistent", so the Phi-detector is **enough**, 
> but if a strong detector buys us more with little cost, wouldn't that 
>  be great?
>
>
>
> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net 
> <ma...@dude.podzone.net>> wrote:
>
>     Is this possible?
>
>     All reads and writes for a given key will always go to the same
>     node from a client.  It seems the only thing needed is to allow
>     the clients to compute which node is the closes replica for the
>     given key using the same algorithm C* uses.  When the first
>     replica receives the write request, it will write to itself which
>     should complete before any of the other replicas and then return.
>      The loads should still stay balanced if using random partitioner.
>      If the first replica becomes unavailable (however that is
>     defined), then the clients can send to the next repilca in the
>     ring and switch from ONE write/reads to QUORUM write/reads
>     temporarily until the first replica becomes available again.
>      QUORUM is required since there could be some replicas that were
>     not updated after the first replica went down.
>
>     Will this work?  The goal is to have strong consistency with a
>     read/write consistency level as low as possible while secondarily
>     a network performance boost.
>
>

Re: Strong Consistency with ONE read/writes

Posted by Yang <te...@gmail.com>.

there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch,
so this does roughly what you want MOST of the time

but the problem is that it does not GUARANTEE that the same node will always
be read.  I recently read into the HBase vs Cassandra comparison thread that
started after Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is essentially what HBase
does. the fundamental difference there is really due to the gossip protocol:
it's a probablistic, or eventually consistent failure detector  while
HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
detector (a distributed lock).  so in HBase, if a tablet server goes down,
it really goes down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol (notifying the master,
which would notify the clients etc),  in other words it is guaranteed that
one tablet is served by only one tablet server at any given time.  in
comparison the above JIRA only TRYIES to serve that key from one particular
replica. HBase can have that guarantee because the group membership is
maintained by the strong failure detector.

just for hacking curiosity, a strong failure detector + Cassandra replicas
is not impossible (actually seems not difficult), although the performance
is not clear. what would such a strong failure detector bring to Cassandra
besides this ONE-ONE strong consistency ? that is an interesting question I
think.

considering that HBase has been deployed on big clusters, it is probably OK
with the performance of the strong  Zookeeper failure detector. then a
further question was: why did Dynamo originally choose to use the
probablistic failure detector? yes Dynamo's main theme is "eventually
consistent", so the Phi-detector is **enough**, but if a strong detector
buys us more with little cost, wouldn't that  be great?

On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj...@dude.podzone.net> wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node from a
> client.  It seems the only thing needed is to allow the clients to compute
> which node is the closes replica for the given key using the same algorithm
> C* uses.  When the first replica receives the write request, it will write
> to itself which should complete before any of the other replicas and then
> return.  The loads should still stay balanced if using random partitioner.
>  If the first replica becomes unavailable (however that is defined), then
> the clients can send to the next repilca in the ring and switch from ONE
> write/reads to QUORUM write/reads temporarily until the first replica
> becomes available again.  QUORUM is required since there could be some
> replicas that were not updated after the first replica went down.
>
> Will this work?  The goal is to have strong consistency with a read/write
> consistency level as low as possible while secondarily a network performance
> boost.
>