You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Baskar Duraikannu <ba...@gmail.com> on 2011/04/18 17:43:52 UTC

Multi-DC Deployment

We are planning to deploy Cassandra on two data centers.   Let us say that we went with three replicas with 2 being in one data center and last replica in 2nd Data center. 

What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3 replicas are unreachable)?  Will they timeout? 


Regards,
Baskar

Re: Multi-DC Deployment

Posted by Peter Schuller <pe...@infidyne.com>.

> Again, for a lot of services, it is fully acceptable, and a lot better, to
> return an almost complete (or maybe even complete, but no verified by
> quorum) result than no result at all.

+1, except maybe "a lot" depending on how one chooses to define that.
There are definitely cases where sufficient information propagating
end-to-end between clients, services and storage systems can be very
conducive towards providing a better end-user experience; particularly
in terms of more graceful degradation when you do have failures that
exceed that which was planned for.

Introducing fallback-to-lower-consistency has significant issues; it's
difficult to get right in a way that is actually useful in degraded
conditions while also not at the same time contributing to your
service committing suicide by magnifying already existing problems by
adding to the load in an attempt to degrade gracefully.

So... just wanted to voice that I think you're raising valid use cases here.

-- 
/ Peter Schuller

Re: Multi-DC Deployment

Posted by Terje Marthinussen <tm...@gmail.com>.

Sure, the update queue could just as well replicate problems, but the queue
would be a lot simpler than cassandra and it would not modify already
acknowledged data like like for instance compaction or read-repair/hint
deliveries may. There is a fair bit of re-writing/re-assemblying of data
even though it is actually never updated since the original data was
acknowledged. From a statistically viewpoint, there is clearly a noticeable
risk in data getting messed up internally in Cassandra during these
operations.

Of course, in a perfectly replicated case, the same error is likely to occur
even in two isolated systems if they have the exact same data in the same
order,  but lets add more fun things that replicate well such as operator
mistakes. Yes, isolated systems increases maintenance work, and increase in
work increases risk of mistakes, but it reduces risk that you do a mistake
that brings down everything.

Bottom line is, I am far from convinced about the benefits of one big
magical system across all datacenters vs. more isolated setups.
It is not a Cassandra specific concern though. It applies to any system.

Backups needs to be there anyway of course, but if you have a system that is
somewhat working ok, but compactions have stopped due to a bad sstable, how
do you recover that from backup without taking down the service or going
through the interesting task of splitting up the cassandra setup, recover on
a limited set of nodes and then re-joining again? (maybe easier to do with
datacenter replication, haven't actually thought about how to do it there).

Yes, I realize that I could do 2 queries (or more) to get the
wanted behavior, but doubling the number of queries when the system is
already in trouble is rarely a good idea.

Terje


On Thu, Apr 21, 2011 at 10:49 AM, Adrian Cockcroft <
adrian.cockcroft@gmail.com> wrote:

> Queues replicate bad data just as well as anything else. The biggest
> source of bad data is broken app code... You will still need to
> implement a reconciliation/repair checker, as queues have their own
> failure modes when they get backed up. We have also looked at using
> queues to bounce data between cassandra clusters for other reasons,
> and they have their place. However it is a lot more work to implement
> than using existing well tested Cassandra functionality to do it for
> us.
>
> I think your code needs to retry a failed local-quorum read with a
> read-one to get the behavior you are asking for.
>
> Our approach to bad data and corruption issues is backups, wind back
> to the last good snapshot. We have figured out incremental backups as
> well as full. Our code has some local dependencies, but could be the
> basis for a generic solution.
>
> Adrian
>
> On Wed, Apr 20, 2011 at 6:08 PM, Terje Marthinussen
> <tm...@gmail.com> wrote:
> > Assuming that you generally put an API on top of this, delivering to two
> or
> > more systems then boils down to a message queue issue or some similar
> > mechanism which handles secure delivery of messages. Maybe not trivial,
> but
> > there are many products that can help you with this, and it is a lot
> easier
> > to implement than a fully distributed storage system.
> > Yes, ideally Cassandra will not distribute corruption, but the reason you
> > pay up to have 2 fully redundant setups in 2 different datacenters is
> > because we do not live in an ideal world. Anyone having tested Cassandra
> > since 0.7.0 with any real data will be able to testify how well it can
> mess
> > things up.
> > This is not specific to Cassandra, in fact, I would argue thats this is
> in
> > the blood of any distributed system. You want them to distribute after
> all
> > and the tighter the coupling is between nodes, the better they distribute
> > bad stuff as well as good stuff.
> > There is a bigger risk for a complete failure with 2 tightly coupled
> > redundant systems than with 2 almost completely isolated ones. The logic
> > here is so simple it is really somewhat beyond discussion.
> > There are a few other advantages of isolating the systems. Especially in
> > terms of operation, 2 isolated systems would be much easier as you could
> > relatively risk fee try out a new cassandra in one datacenter or upgrade
> one
> > datacenter at a time if you needed major operational changes such as
> schema
> > changes or other large changes to the data.
> > I see the 2 copies in one datacenters + 1(or maybe 2) in another as a
> "low
> > cost" middleway between 2 full N+2 (RF=3) systems in both data centers.
> > That is, in a traditional design where you need 1 node for normal
> service,
> > you would have 1 extra replicate for redundancy and one replica more (N+2
> > redundancy) so you can do maintenance and still be redundant.
> > If I have redundancy across datacenters, I would probably still want 2
> > replicas to avoid network traffic between DCs in case of a node recovery,
> > but N+2 may not be needed as my risk policy may find it acceptable to run
> > one datacenters without redundancy for a time limited period for
> > maintenance.
> > That is, if my original requirement is 1 node, I could do with 3x the HW
> > which is not all that much more than the 3x I need for one DC and a lot
> less
> > than the 6x I need for 2 full N+2 systems.
> > However, all of the above is really beyond the point of my original
> > suggestion.
> > Regardless of datacenters, redundancy and distribution of bad or good
> stuff,
> > it would be good to have a way to return whatever data is there, but with
> a
> > flag or similar stating that the consistency level was not met.
> > Again, for a lot of services, it is fully acceptable, and a lot better,
> to
> > return an almost complete (or maybe even complete, but no verified by
> > quorum) result than no result at all.
> > As far as I remember from the code, this just boils down to returning
> > whatever you collected from the cluster and setting the proper flag or
> > similar on the resultset rather than returning an error.
> > Terje
> > On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft
> > <ad...@gmail.com> wrote:
> >>
> >> Hi Terje,
> >>
> >> If you feed data to two rings, you will get inconsistency drift as an
> >> update to one succeeds and to the other fails from time to time. You
> >> would have to build your own read repair. This all starts to look like
> >> "I don't trust Cassandra code to work, so I will write my own buggy
> >> one off versions of Cassandra functionality". I lean towards using
> >> Cassandra features rather than rolling my own because there is a large
> >> community testing, fixing and extending Cassandra, and making sure
> >> that the algorithms are robust. Distributed systems are very hard to
> >> get right, I trust lots of users and eyeballs on the code more than
> >> even the best engineer working alone.
> >>
> >> Cassandra doesn't "replicate sstable corruptions". It detects corrupt
> >> data and only replicates good data. Also data isn't replicated to
> >> three identical nodes in the way you imply, it's replicated around the
> >> ring. If you lose three nodes, you don't lose a whole node's worth of
> >> data.  We configure each replica to be in a different availability
> >> zone so that we can lose a third of our nodes (a whole zone) and still
> >> work. On a 300 node system with RF=3 and no zones, losing one or two
> >> nodes you still have all your data, and can repair the loss quickly.
> >> With three nodes dead at once you don't lose 1% of the data (3/300) I
> >> think you lose 1/(300*300*300) of the data (someone check my math?).
> >>
> >> If you want to always get a result, then you use "read one", if you
> >> want to get a highly available better quality result use local quorum.
> >> That is a per-query option.
> >>
> >> Adrian
> >>
> >> On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
> >> <tm...@gmail.com> wrote:
> >> > If you have RF=3 in both datacenters, it could be discussed if there
> is
> >> > a
> >> > point to use the built in replication in Cassandra at all vs. feeding
> >> > the
> >> > data to both datacenters and get 2 100% isolated cassandra instances
> >> > that
> >> > cannot replicate sstable corruptions between each others....
> >> > My point is really a bit more general though.
> >> > For a lot services (especially Internet based ones) 100% accuracy in
> >> > terms
> >> > of results is not needed (or maybe even expected)
> >> > While you want to serve a 100% correct result if you can (using
> quorum),
> >> > it
> >> > is still much better to serve a partial result than no result at all.
> >> > Lets say you have 300 nodes in your ring, one document manages to
> >> > trigger a
> >> > bug in cassandra that brings down a node with all its replicas (3
> nodes
> >> > down)
> >> > For many use cases, it would be much better to return the remaining
> 99%
> >> > of
> >> > the data coming from the 297 working nodes than having a service which
> >> > returns nothing at all.
> >> > I would however like the frontend to realize that this is an
> incomplete
> >> > result so it is possible for it to react accordingly as well as be
> part
> >> > of
> >> > monitoring of the cassandra ring.
> >> > Regards,
> >> > Terje
> >> >
> >> > On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft
> >> > <ad...@gmail.com> wrote:
> >> >>
> >> >> If you want to use local quorum for a distributed setup, it doesn't
> >> >> make sense to have less than RF=3 local and remote. Three copies at
> >> >> both ends will give you high availability. Only one copy of the data
> >> >> is sent over the wide area link (with recent versions).
> >> >>
> >> >> There is no need to use mirrored or RAID5 disk in each node in this
> >> >> case, since you are using RAIN (N for nodes) to protect your data. So
> >> >> the extra disk space to hold three copies at each end shouldn't be a
> >> >> big deal. Netflix is using striped internal disks on EC2 nodes for
> >> >> this.
> >> >>
> >> >> Adrian
> >> >>
> >> >> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
> >> >> <tm...@gmail.com> wrote:
> >> >> > Hum...
> >> >> > Seems like it could be an idea in a case like this with a mode
> where
> >> >> > result
> >> >> > is always returned (if possible), but where a flay saying if the
> >> >> > consistency
> >> >> > level was met, or to what level it was met (number of nodes
> answering
> >> >> > for
> >> >> > instance).?
> >> >> > Terje
> >> >> >
> >> >> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jbellis@gmail.com
> >
> >> >> > wrote:
> >> >> >>
> >> >> >> They will timeout until failure detector realizes the DC1 nodes
> are
> >> >> >> down (~10 seconds). After that they will immediately return
> >> >> >> UnavailableException until DC1 comes back up.
> >> >> >>
> >> >> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
> >> >> >> <ba...@gmail.com> wrote:
> >> >> >> > We are planning to deploy Cassandra on two data centers.   Let
> us
> >> >> >> > say
> >> >> >> > that
> >> >> >> > we went with three replicas with 2 being in one data center and
> >> >> >> > last
> >> >> >> > replica
> >> >> >> > in 2nd Data center.
> >> >> >> >
> >> >> >> > What will happen to Quorum Reads and Writes when DC1 goes down
> (2
> >> >> >> > of
> >> >> >> > 3
> >> >> >> > replicas are unreachable)?  Will they timeout?
> >> >> >> >
> >> >> >> >
> >> >> >> > Regards,
> >> >> >> > Baskar
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Jonathan Ellis
> >> >> >> Project Chair, Apache Cassandra
> >> >> >> co-founder of DataStax, the source for professional Cassandra
> >> >> >> support
> >> >> >> http://www.datastax.com
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Multi-DC Deployment

Posted by Adrian Cockcroft <ad...@gmail.com>.

Queues replicate bad data just as well as anything else. The biggest
source of bad data is broken app code... You will still need to
implement a reconciliation/repair checker, as queues have their own
failure modes when they get backed up. We have also looked at using
queues to bounce data between cassandra clusters for other reasons,
and they have their place. However it is a lot more work to implement
than using existing well tested Cassandra functionality to do it for
us.

I think your code needs to retry a failed local-quorum read with a
read-one to get the behavior you are asking for.

Our approach to bad data and corruption issues is backups, wind back
to the last good snapshot. We have figured out incremental backups as
well as full. Our code has some local dependencies, but could be the
basis for a generic solution.

Adrian

On Wed, Apr 20, 2011 at 6:08 PM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Assuming that you generally put an API on top of this, delivering to two or
> more systems then boils down to a message queue issue or some similar
> mechanism which handles secure delivery of messages. Maybe not trivial, but
> there are many products that can help you with this, and it is a lot easier
> to implement than a fully distributed storage system.
> Yes, ideally Cassandra will not distribute corruption, but the reason you
> pay up to have 2 fully redundant setups in 2 different datacenters is
> because we do not live in an ideal world. Anyone having tested Cassandra
> since 0.7.0 with any real data will be able to testify how well it can mess
> things up.
> This is not specific to Cassandra, in fact, I would argue thats this is in
> the blood of any distributed system. You want them to distribute after all
> and the tighter the coupling is between nodes, the better they distribute
> bad stuff as well as good stuff.
> There is a bigger risk for a complete failure with 2 tightly coupled
> redundant systems than with 2 almost completely isolated ones. The logic
> here is so simple it is really somewhat beyond discussion.
> There are a few other advantages of isolating the systems. Especially in
> terms of operation, 2 isolated systems would be much easier as you could
> relatively risk fee try out a new cassandra in one datacenter or upgrade one
> datacenter at a time if you needed major operational changes such as schema
> changes or other large changes to the data.
> I see the 2 copies in one datacenters + 1(or maybe 2) in another as a "low
> cost" middleway between 2 full N+2 (RF=3) systems in both data centers.
> That is, in a traditional design where you need 1 node for normal service,
> you would have 1 extra replicate for redundancy and one replica more (N+2
> redundancy) so you can do maintenance and still be redundant.
> If I have redundancy across datacenters, I would probably still want 2
> replicas to avoid network traffic between DCs in case of a node recovery,
> but N+2 may not be needed as my risk policy may find it acceptable to run
> one datacenters without redundancy for a time limited period for
> maintenance.
> That is, if my original requirement is 1 node, I could do with 3x the HW
> which is not all that much more than the 3x I need for one DC and a lot less
> than the 6x I need for 2 full N+2 systems.
> However, all of the above is really beyond the point of my original
> suggestion.
> Regardless of datacenters, redundancy and distribution of bad or good stuff,
> it would be good to have a way to return whatever data is there, but with a
> flag or similar stating that the consistency level was not met.
> Again, for a lot of services, it is fully acceptable, and a lot better, to
> return an almost complete (or maybe even complete, but no verified by
> quorum) result than no result at all.
> As far as I remember from the code, this just boils down to returning
> whatever you collected from the cluster and setting the proper flag or
> similar on the resultset rather than returning an error.
> Terje
> On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft
> <ad...@gmail.com> wrote:
>>
>> Hi Terje,
>>
>> If you feed data to two rings, you will get inconsistency drift as an
>> update to one succeeds and to the other fails from time to time. You
>> would have to build your own read repair. This all starts to look like
>> "I don't trust Cassandra code to work, so I will write my own buggy
>> one off versions of Cassandra functionality". I lean towards using
>> Cassandra features rather than rolling my own because there is a large
>> community testing, fixing and extending Cassandra, and making sure
>> that the algorithms are robust. Distributed systems are very hard to
>> get right, I trust lots of users and eyeballs on the code more than
>> even the best engineer working alone.
>>
>> Cassandra doesn't "replicate sstable corruptions". It detects corrupt
>> data and only replicates good data. Also data isn't replicated to
>> three identical nodes in the way you imply, it's replicated around the
>> ring. If you lose three nodes, you don't lose a whole node's worth of
>> data.  We configure each replica to be in a different availability
>> zone so that we can lose a third of our nodes (a whole zone) and still
>> work. On a 300 node system with RF=3 and no zones, losing one or two
>> nodes you still have all your data, and can repair the loss quickly.
>> With three nodes dead at once you don't lose 1% of the data (3/300) I
>> think you lose 1/(300*300*300) of the data (someone check my math?).
>>
>> If you want to always get a result, then you use "read one", if you
>> want to get a highly available better quality result use local quorum.
>> That is a per-query option.
>>
>> Adrian
>>
>> On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
>> <tm...@gmail.com> wrote:
>> > If you have RF=3 in both datacenters, it could be discussed if there is
>> > a
>> > point to use the built in replication in Cassandra at all vs. feeding
>> > the
>> > data to both datacenters and get 2 100% isolated cassandra instances
>> > that
>> > cannot replicate sstable corruptions between each others....
>> > My point is really a bit more general though.
>> > For a lot services (especially Internet based ones) 100% accuracy in
>> > terms
>> > of results is not needed (or maybe even expected)
>> > While you want to serve a 100% correct result if you can (using quorum),
>> > it
>> > is still much better to serve a partial result than no result at all.
>> > Lets say you have 300 nodes in your ring, one document manages to
>> > trigger a
>> > bug in cassandra that brings down a node with all its replicas (3 nodes
>> > down)
>> > For many use cases, it would be much better to return the remaining 99%
>> > of
>> > the data coming from the 297 working nodes than having a service which
>> > returns nothing at all.
>> > I would however like the frontend to realize that this is an incomplete
>> > result so it is possible for it to react accordingly as well as be part
>> > of
>> > monitoring of the cassandra ring.
>> > Regards,
>> > Terje
>> >
>> > On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft
>> > <ad...@gmail.com> wrote:
>> >>
>> >> If you want to use local quorum for a distributed setup, it doesn't
>> >> make sense to have less than RF=3 local and remote. Three copies at
>> >> both ends will give you high availability. Only one copy of the data
>> >> is sent over the wide area link (with recent versions).
>> >>
>> >> There is no need to use mirrored or RAID5 disk in each node in this
>> >> case, since you are using RAIN (N for nodes) to protect your data. So
>> >> the extra disk space to hold three copies at each end shouldn't be a
>> >> big deal. Netflix is using striped internal disks on EC2 nodes for
>> >> this.
>> >>
>> >> Adrian
>> >>
>> >> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
>> >> <tm...@gmail.com> wrote:
>> >> > Hum...
>> >> > Seems like it could be an idea in a case like this with a mode where
>> >> > result
>> >> > is always returned (if possible), but where a flay saying if the
>> >> > consistency
>> >> > level was met, or to what level it was met (number of nodes answering
>> >> > for
>> >> > instance).?
>> >> > Terje
>> >> >
>> >> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> They will timeout until failure detector realizes the DC1 nodes are
>> >> >> down (~10 seconds). After that they will immediately return
>> >> >> UnavailableException until DC1 comes back up.
>> >> >>
>> >> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
>> >> >> <ba...@gmail.com> wrote:
>> >> >> > We are planning to deploy Cassandra on two data centers.   Let us
>> >> >> > say
>> >> >> > that
>> >> >> > we went with three replicas with 2 being in one data center and
>> >> >> > last
>> >> >> > replica
>> >> >> > in 2nd Data center.
>> >> >> >
>> >> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2
>> >> >> > of
>> >> >> > 3
>> >> >> > replicas are unreachable)?  Will they timeout?
>> >> >> >
>> >> >> >
>> >> >> > Regards,
>> >> >> > Baskar
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Jonathan Ellis
>> >> >> Project Chair, Apache Cassandra
>> >> >> co-founder of DataStax, the source for professional Cassandra
>> >> >> support
>> >> >> http://www.datastax.com
>> >> >
>> >> >
>> >
>> >
>
>

Re: Multi-DC Deployment

Posted by Terje Marthinussen <tm...@gmail.com>.

Assuming that you generally put an API on top of this, delivering to two or
more systems then boils down to a message queue issue or some similar
mechanism which handles secure delivery of messages. Maybe not trivial, but
there are many products that can help you with this, and it is a lot easier
to implement than a fully distributed storage system.

Yes, ideally Cassandra will not distribute corruption, but the reason you
pay up to have 2 fully redundant setups in 2 different datacenters is
because we do not live in an ideal world. Anyone having tested Cassandra
since 0.7.0 with any real data will be able to testify how well it can mess
things up.

This is not specific to Cassandra, in fact, I would argue thats this is in
the blood of any distributed system. You want them to distribute after all
and the tighter the coupling is between nodes, the better they distribute
bad stuff as well as good stuff.

There is a bigger risk for a complete failure with 2 tightly coupled
redundant systems than with 2 almost completely isolated ones. The logic
here is so simple it is really somewhat beyond discussion.

There are a few other advantages of isolating the systems. Especially in
terms of operation, 2 isolated systems would be much easier as you could
relatively risk fee try out a new cassandra in one datacenter or upgrade one
datacenter at a time if you needed major operational changes such as schema
changes or other large changes to the data.

I see the 2 copies in one datacenters + 1(or maybe 2) in another as a "low
cost" middleway between 2 full N+2 (RF=3) systems in both data centers.

That is, in a traditional design where you need 1 node for normal service,
you would have 1 extra replicate for redundancy and one replica more (N+2
redundancy) so you can do maintenance and still be redundant.

If I have redundancy across datacenters, I would probably still want 2
replicas to avoid network traffic between DCs in case of a node recovery,
but N+2 may not be needed as my risk policy may find it acceptable to run
one datacenters without redundancy for a time limited period for
maintenance.

That is, if my original requirement is 1 node, I could do with 3x the HW
which is not all that much more than the 3x I need for one DC and a lot less
than the 6x I need for 2 full N+2 systems.

However, all of the above is really beyond the point of my original
suggestion.

Regardless of datacenters, redundancy and distribution of bad or good stuff,
it would be good to have a way to return whatever data is there, but with a
flag or similar stating that the consistency level was not met.

Again, for a lot of services, it is fully acceptable, and a lot better, to
return an almost complete (or maybe even complete, but no verified by
quorum) result than no result at all.

As far as I remember from the code, this just boils down to returning
whatever you collected from the cluster and setting the proper flag or
similar on the resultset rather than returning an error.

Terje

On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft <
adrian.cockcroft@gmail.com> wrote:

> Hi Terje,
>
> If you feed data to two rings, you will get inconsistency drift as an
> update to one succeeds and to the other fails from time to time. You
> would have to build your own read repair. This all starts to look like
> "I don't trust Cassandra code to work, so I will write my own buggy
> one off versions of Cassandra functionality". I lean towards using
> Cassandra features rather than rolling my own because there is a large
> community testing, fixing and extending Cassandra, and making sure
> that the algorithms are robust. Distributed systems are very hard to
> get right, I trust lots of users and eyeballs on the code more than
> even the best engineer working alone.
>
> Cassandra doesn't "replicate sstable corruptions". It detects corrupt
> data and only replicates good data. Also data isn't replicated to
> three identical nodes in the way you imply, it's replicated around the
> ring. If you lose three nodes, you don't lose a whole node's worth of
> data.  We configure each replica to be in a different availability
> zone so that we can lose a third of our nodes (a whole zone) and still
> work. On a 300 node system with RF=3 and no zones, losing one or two
> nodes you still have all your data, and can repair the loss quickly.
> With three nodes dead at once you don't lose 1% of the data (3/300) I
> think you lose 1/(300*300*300) of the data (someone check my math?).
>
> If you want to always get a result, then you use "read one", if you
> want to get a highly available better quality result use local quorum.
> That is a per-query option.
>
> Adrian
>
> On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
> <tm...@gmail.com> wrote:
> > If you have RF=3 in both datacenters, it could be discussed if there is a
> > point to use the built in replication in Cassandra at all vs. feeding the
> > data to both datacenters and get 2 100% isolated cassandra instances that
> > cannot replicate sstable corruptions between each others....
> > My point is really a bit more general though.
> > For a lot services (especially Internet based ones) 100% accuracy in
> terms
> > of results is not needed (or maybe even expected)
> > While you want to serve a 100% correct result if you can (using quorum),
> it
> > is still much better to serve a partial result than no result at all.
> > Lets say you have 300 nodes in your ring, one document manages to trigger
> a
> > bug in cassandra that brings down a node with all its replicas (3 nodes
> > down)
> > For many use cases, it would be much better to return the remaining 99%
> of
> > the data coming from the 297 working nodes than having a service which
> > returns nothing at all.
> > I would however like the frontend to realize that this is an incomplete
> > result so it is possible for it to react accordingly as well as be part
> of
> > monitoring of the cassandra ring.
> > Regards,
> > Terje
> >
> > On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft
> > <ad...@gmail.com> wrote:
> >>
> >> If you want to use local quorum for a distributed setup, it doesn't
> >> make sense to have less than RF=3 local and remote. Three copies at
> >> both ends will give you high availability. Only one copy of the data
> >> is sent over the wide area link (with recent versions).
> >>
> >> There is no need to use mirrored or RAID5 disk in each node in this
> >> case, since you are using RAIN (N for nodes) to protect your data. So
> >> the extra disk space to hold three copies at each end shouldn't be a
> >> big deal. Netflix is using striped internal disks on EC2 nodes for
> >> this.
> >>
> >> Adrian
> >>
> >> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
> >> <tm...@gmail.com> wrote:
> >> > Hum...
> >> > Seems like it could be an idea in a case like this with a mode where
> >> > result
> >> > is always returned (if possible), but where a flay saying if the
> >> > consistency
> >> > level was met, or to what level it was met (number of nodes answering
> >> > for
> >> > instance).?
> >> > Terje
> >> >
> >> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com>
> >> > wrote:
> >> >>
> >> >> They will timeout until failure detector realizes the DC1 nodes are
> >> >> down (~10 seconds). After that they will immediately return
> >> >> UnavailableException until DC1 comes back up.
> >> >>
> >> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
> >> >> <ba...@gmail.com> wrote:
> >> >> > We are planning to deploy Cassandra on two data centers.   Let us
> say
> >> >> > that
> >> >> > we went with three replicas with 2 being in one data center and
> last
> >> >> > replica
> >> >> > in 2nd Data center.
> >> >> >
> >> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2
> of
> >> >> > 3
> >> >> > replicas are unreachable)?  Will they timeout?
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > Baskar
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jonathan Ellis
> >> >> Project Chair, Apache Cassandra
> >> >> co-founder of DataStax, the source for professional Cassandra support
> >> >> http://www.datastax.com
> >> >
> >> >
> >
> >
>

Re: Multi-DC Deployment

Posted by Peter Schuller <pe...@infidyne.com>.

> Cassandra doesn't "replicate sstable corruptions". It detects corrupt
> data and only replicates good data.

This is incorrect. Depending on the nature of the corruption it may
spread to other nodes. Checksumming (done right) would be a great
addition to alleiate this. Yes, there is code that tries to skip rows
that are obviously bad, but "true" integrity checking is not supported
at this time.

-- 
/ Peter Schuller

Re: Multi-DC Deployment

Posted by Adrian Cockcroft <ad...@gmail.com>.

Hi Terje,

If you feed data to two rings, you will get inconsistency drift as an
update to one succeeds and to the other fails from time to time. You
would have to build your own read repair. This all starts to look like
"I don't trust Cassandra code to work, so I will write my own buggy
one off versions of Cassandra functionality". I lean towards using
Cassandra features rather than rolling my own because there is a large
community testing, fixing and extending Cassandra, and making sure
that the algorithms are robust. Distributed systems are very hard to
get right, I trust lots of users and eyeballs on the code more than
even the best engineer working alone.

Cassandra doesn't "replicate sstable corruptions". It detects corrupt
data and only replicates good data. Also data isn't replicated to
three identical nodes in the way you imply, it's replicated around the
ring. If you lose three nodes, you don't lose a whole node's worth of
data.  We configure each replica to be in a different availability
zone so that we can lose a third of our nodes (a whole zone) and still
work. On a 300 node system with RF=3 and no zones, losing one or two
nodes you still have all your data, and can repair the loss quickly.
With three nodes dead at once you don't lose 1% of the data (3/300) I
think you lose 1/(300*300*300) of the data (someone check my math?).

If you want to always get a result, then you use "read one", if you
want to get a highly available better quality result use local quorum.
That is a per-query option.

Adrian

On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
<tm...@gmail.com> wrote:
> If you have RF=3 in both datacenters, it could be discussed if there is a
> point to use the built in replication in Cassandra at all vs. feeding the
> data to both datacenters and get 2 100% isolated cassandra instances that
> cannot replicate sstable corruptions between each others....
> My point is really a bit more general though.
> For a lot services (especially Internet based ones) 100% accuracy in terms
> of results is not needed (or maybe even expected)
> While you want to serve a 100% correct result if you can (using quorum), it
> is still much better to serve a partial result than no result at all.
> Lets say you have 300 nodes in your ring, one document manages to trigger a
> bug in cassandra that brings down a node with all its replicas (3 nodes
> down)
> For many use cases, it would be much better to return the remaining 99% of
> the data coming from the 297 working nodes than having a service which
> returns nothing at all.
> I would however like the frontend to realize that this is an incomplete
> result so it is possible for it to react accordingly as well as be part of
> monitoring of the cassandra ring.
> Regards,
> Terje
>
> On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft
> <ad...@gmail.com> wrote:
>>
>> If you want to use local quorum for a distributed setup, it doesn't
>> make sense to have less than RF=3 local and remote. Three copies at
>> both ends will give you high availability. Only one copy of the data
>> is sent over the wide area link (with recent versions).
>>
>> There is no need to use mirrored or RAID5 disk in each node in this
>> case, since you are using RAIN (N for nodes) to protect your data. So
>> the extra disk space to hold three copies at each end shouldn't be a
>> big deal. Netflix is using striped internal disks on EC2 nodes for
>> this.
>>
>> Adrian
>>
>> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
>> <tm...@gmail.com> wrote:
>> > Hum...
>> > Seems like it could be an idea in a case like this with a mode where
>> > result
>> > is always returned (if possible), but where a flay saying if the
>> > consistency
>> > level was met, or to what level it was met (number of nodes answering
>> > for
>> > instance).?
>> > Terje
>> >
>> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> They will timeout until failure detector realizes the DC1 nodes are
>> >> down (~10 seconds). After that they will immediately return
>> >> UnavailableException until DC1 comes back up.
>> >>
>> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
>> >> <ba...@gmail.com> wrote:
>> >> > We are planning to deploy Cassandra on two data centers.   Let us say
>> >> > that
>> >> > we went with three replicas with 2 being in one data center and last
>> >> > replica
>> >> > in 2nd Data center.
>> >> >
>> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of
>> >> > 3
>> >> > replicas are unreachable)?  Will they timeout?
>> >> >
>> >> >
>> >> > Regards,
>> >> > Baskar
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of DataStax, the source for professional Cassandra support
>> >> http://www.datastax.com
>> >
>> >
>
>

Re: Multi-DC Deployment

Posted by Terje Marthinussen <tm...@gmail.com>.

If you have RF=3 in both datacenters, it could be discussed if there is a
point to use the built in replication in Cassandra at all vs. feeding the
data to both datacenters and get 2 100% isolated cassandra instances that
cannot replicate sstable corruptions between each others....

My point is really a bit more general though.

For a lot services (especially Internet based ones) 100% accuracy in terms
of results is not needed (or maybe even expected)
While you want to serve a 100% correct result if you can (using quorum), it
is still much better to serve a partial result than no result at all.

Lets say you have 300 nodes in your ring, one document manages to trigger a
bug in cassandra that brings down a node with all its replicas (3 nodes
down)

For many use cases, it would be much better to return the remaining 99% of
the data coming from the 297 working nodes than having a service which
returns nothing at all.

I would however like the frontend to realize that this is an incomplete
result so it is possible for it to react accordingly as well as be part of
monitoring of the cassandra ring.

Regards,
Terje

On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft <
adrian.cockcroft@gmail.com> wrote:

> If you want to use local quorum for a distributed setup, it doesn't
> make sense to have less than RF=3 local and remote. Three copies at
> both ends will give you high availability. Only one copy of the data
> is sent over the wide area link (with recent versions).
>
> There is no need to use mirrored or RAID5 disk in each node in this
> case, since you are using RAIN (N for nodes) to protect your data. So
> the extra disk space to hold three copies at each end shouldn't be a
> big deal. Netflix is using striped internal disks on EC2 nodes for
> this.
>
> Adrian
>
> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
> <tm...@gmail.com> wrote:
> > Hum...
> > Seems like it could be an idea in a case like this with a mode where
> result
> > is always returned (if possible), but where a flay saying if the
> consistency
> > level was met, or to what level it was met (number of nodes answering for
> > instance).?
> > Terje
> >
> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> They will timeout until failure detector realizes the DC1 nodes are
> >> down (~10 seconds). After that they will immediately return
> >> UnavailableException until DC1 comes back up.
> >>
> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
> >> <ba...@gmail.com> wrote:
> >> > We are planning to deploy Cassandra on two data centers.   Let us say
> >> > that
> >> > we went with three replicas with 2 being in one data center and last
> >> > replica
> >> > in 2nd Data center.
> >> >
> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
> >> > replicas are unreachable)?  Will they timeout?
> >> >
> >> >
> >> > Regards,
> >> > Baskar
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >
> >
>

Re: Multi-DC Deployment

Posted by Adrian Cockcroft <ad...@gmail.com>.

If you want to use local quorum for a distributed setup, it doesn't
make sense to have less than RF=3 local and remote. Three copies at
both ends will give you high availability. Only one copy of the data
is sent over the wide area link (with recent versions).

There is no need to use mirrored or RAID5 disk in each node in this
case, since you are using RAIN (N for nodes) to protect your data. So
the extra disk space to hold three copies at each end shouldn't be a
big deal. Netflix is using striped internal disks on EC2 nodes for
this.

Adrian

On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Hum...
> Seems like it could be an idea in a case like this with a mode where result
> is always returned (if possible), but where a flay saying if the consistency
> level was met, or to what level it was met (number of nodes answering for
> instance).?
> Terje
>
> On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> They will timeout until failure detector realizes the DC1 nodes are
>> down (~10 seconds). After that they will immediately return
>> UnavailableException until DC1 comes back up.
>>
>> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
>> <ba...@gmail.com> wrote:
>> > We are planning to deploy Cassandra on two data centers.   Let us say
>> > that
>> > we went with three replicas with 2 being in one data center and last
>> > replica
>> > in 2nd Data center.
>> >
>> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
>> > replicas are unreachable)?  Will they timeout?
>> >
>> >
>> > Regards,
>> > Baskar
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>

Re: Multi-DC Deployment

Posted by Terje Marthinussen <tm...@gmail.com>.

Hum...

Seems like it could be an idea in a case like this with a mode where result
is always returned (if possible), but where a flay saying if the consistency
level was met, or to what level it was met (number of nodes answering for
instance).?

Terje

On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> They will timeout until failure detector realizes the DC1 nodes are
> down (~10 seconds). After that they will immediately return
> UnavailableException until DC1 comes back up.
>
> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
> <ba...@gmail.com> wrote:
> > We are planning to deploy Cassandra on two data centers.   Let us say
> that
> > we went with three replicas with 2 being in one data center and last
> replica
> > in 2nd Data center.
> >
> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
> > replicas are unreachable)?  Will they timeout?
> >
> >
> > Regards,
> > Baskar
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Multi-DC Deployment

Posted by Jonathan Ellis <jb...@gmail.com>.

They will timeout until failure detector realizes the DC1 nodes are
down (~10 seconds). After that they will immediately return
UnavailableException until DC1 comes back up.

On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
<ba...@gmail.com> wrote:
> We are planning to deploy Cassandra on two data centers.   Let us say that
> we went with three replicas with 2 being in one data center and last replica
> in 2nd Data center.
>
> What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
> replicas are unreachable)?  Will they timeout?
>
>
> Regards,
> Baskar



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com