You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Les Hazlewood <le...@katasoft.com> on 2011/06/22 23:24:13 UTC

99.999% uptime - Operations Best Practices?

I'm planning on using Cassandra as a product's core data store, and it is
imperative that it never goes down or loses data, even in the event of a
data center failure.  This uptime requirement ("five nines": 99.999% uptime)
w/ WAN capabilities is largely what led me to choose Cassandra over other
NoSQL products, given its history and 'from the ground up' design for such
operational benefits.

However, in a recent thread, a user indicated that all 4 of 4 of his
Cassandra instances were down because the OS killed the Java processes due
to memory starvation, and all 4 instances went down in a relatively short
period of time of each other.  Another user helped out and replied that
running 0.8 and nodetool repair on each node regularly via a cron job (once
a day?) seems to work for him.

Naturally this was disconcerting to read, given our needs for a Highly
Available product - we'd be royally screwed if this ever happened to us.
 But given Cassandra's history and it's current production use, I'm aware
that this HA/uptime is being achieved today, and I believe it is certainly
achievable.

So, is there a collective set of guidelines or best practices to ensure this
problem (or unavailability due to OOM) can be easily managed?

Things like memory settings, initial GC recommendations, cron
recommendations, ulimit settings, etc. that can be bundled up as a
best-practices "Production Kickstart"?

Could anyone share their nuggets of wisdom or point me to resources where
this may already exist?

Thanks!

Best regards,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Nate McCall <na...@datastax.com>.

As an additional concrete detail to Edward's response, 'result
pinning' can provide some performance improvements depending on
topology and workload. See the conf file comments for details:
https://github.com/apache/cassandra/blob/cassandra-0.8.0/conf/cassandra.yaml#L308-315

I would also advise to take the time to experiment with consistency
levels (particularly in multi-DC setup) and their effect on response
times and weigh those against your consistency requirements.

For the record, any performance twiddling will only provide useful
results when comparable metrics are available for the similar workload
(Les, it appears you have a good grasp of this already - just wanted
to re-iterate).

Re: 99.999% uptime - Operations Best Practices?

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/23/2011 01:56 PM, Les Hazlewood wrote:
> Is there a roadmap or time to 1.0?  Even a ballpark time (e.g next year 3rd
> quarter, end of year, etc) would be great as it would help me understand
> where it may lie in relation to my production rollout.

The C* devs are rather strongly inclined against putting too much
meaning in version numbers.  The next major release might be called 1.0.
Or maybe it won't.  Either way it won't be different code or support
from something called 0.9 or 10.0.

September 8th is the feature freeze for the next major release.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

On Thu, Jun 23, 2011 at 5:59 AM, Dominic Williams
<dw...@system7.co.uk>wrote:

> Cassandra is a good system, but it has not reached version 1.0 yet, nor has
> HBase etc. It is cutting edge technology and therefore in practice you are
> unlikely to achieve five nines immediately - even if in theory with perfect
> planning, perfect administration and so on, this should be achievable even
> now.
>

Yep, this is a totally fair (and appreciated) point.

>
> The reasons you might choose Cassandra are:-
> 1. New more flexible data model that may increase developer productivity
> and lead to fast release cycle
>

Do you mean new to the developer?  Or a new feature in Cassandra (e.g.
something added to 0.8 for example)?

> 2. Superior capability as concerns being able to *write* large volumes of
> data, which is incredibly useful in many applications
>

Yep, this is obviously valuable for data crunching, analytics, reporting,
etc.  But how often is Cassandra used in 'read mostly' use cases?

> 3. Horizontal scalability, where you can add nodes rather than buying
> bigger machines
> 4. Data redundancy, which means you have a kind of live backup going on a
> bit like RAID - we use replication factor 3 for example
> 5. Due to the redundancy of data across the cluster, the ability to perform
> rolling restarts to administer and upgrade your nodes while the cluster
> continues to run (yes, this is the feature that in theory allows for
> continual operation, but in practice until we reach 1.0 I don't think five
> nines of uptime is always possible in every scenario yet because of
> deficiencies that may present themselves unexpectedly)
>

Is there a roadmap or time to 1.0?  Even a ballpark time (e.g next year 3rd
quarter, end of year, etc) would be great as it would help me understand
where it may lie in relation to my production rollout.

> 6. The benefit of building your new product on a platform designed to solve
> many modern computing challenges that will give you a better upgrade path
> e.g. for example in future when you grow you won't have to change over from
> SQL to NoSQL because you're already on it!
>

Indeed, this is why we're evaluating Cassandra and a small number of others.

To this end, how often to people use Cassandra today as their primary
(only?) data store as a full replacement for MySQL/Oracle/Postgres?  I
understand there are many use cases to use Cassandra in special cases and in
addition to a SQL data store.  But is it used frequently enough as a total
replacement?

There's no right/wrong answer for me - I'm just trying to get a feel for the
"we use it a lot for data mining and analytics and for 25% of our OLTP
needs" vs "we use it for all of our app's primary needs".  I'm just curious.

Thoughts?

Thanks!

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Dominic Williams <dw...@system7.co.uk>.

Les,

Cassandra is a good system, but it has not reached version 1.0 yet, nor has
HBase etc. It is cutting edge technology and therefore in practice you are
unlikely to achieve five nines immediately - even if in theory with perfect
planning, perfect administration and so on, this should be achievable even
now.

The reasons you might choose Cassandra are:-
1. New more flexible data model that may increase developer productivity and
lead to fast release cycle
2. Superior capability as concerns being able to *write* large volumes of
data, which is incredibly useful in many applications
3. Horizontal scalability, where you can add nodes rather than buying bigger
machines
4. Data redundancy, which means you have a kind of live backup going on a
bit like RAID - we use replication factor 3 for example
5. Due to the redundancy of data across the cluster, the ability to perform
rolling restarts to administer and upgrade your nodes while the cluster
continues to run (yes, this is the feature that in theory allows for
continual operation, but in practice until we reach 1.0 I don't think five
nines of uptime is always possible in every scenario yet because of
deficiencies that may present themselves unexpectedly)
6. The benefit of building your new product on a platform designed to solve
many modern computing challenges that will give you a better upgrade path
e.g. for example in future when you grow you won't have to change over from
SQL to NoSQL because you're already on it!

These are pretty compelling arguments, but you have to be realistic about
where Cassandra is right now. For what it's worth though, you might also
consider how easy it is to screw up databases running on commercial
production software that are handling very large amounts of data (just let
the volumes handling the commit log run short of disk space for example).
Setting up a Cassandra cluster is the simplest way to handle big data I've
seen and this reduction in complexity will also contribute to uptime.

Best, Dominic

On 22 June 2011 22:24, Les Hazlewood <le...@katasoft.com> wrote:

> I'm planning on using Cassandra as a product's core data store, and it is
> imperative that it never goes down or loses data, even in the event of a
> data center failure.  This uptime requirement ("five nines": 99.999% uptime)
> w/ WAN capabilities is largely what led me to choose Cassandra over other
> NoSQL products, given its history and 'from the ground up' design for such
> operational benefits.
>
> However, in a recent thread, a user indicated that all 4 of 4 of his
> Cassandra instances were down because the OS killed the Java processes due
> to memory starvation, and all 4 instances went down in a relatively short
> period of time of each other.  Another user helped out and replied that
> running 0.8 and nodetool repair on each node regularly via a cron job (once
> a day?) seems to work for him.
>
> Naturally this was disconcerting to read, given our needs for a Highly
> Available product - we'd be royally screwed if this ever happened to us.
>  But given Cassandra's history and it's current production use, I'm aware
> that this HA/uptime is being achieved today, and I believe it is certainly
> achievable.
>
> So, is there a collective set of guidelines or best practices to ensure
> this problem (or unavailability due to OOM) can be easily managed?
>
> Things like memory settings, initial GC recommendations, cron
> recommendations, ulimit settings, etc. that can be bundled up as a
> best-practices "Production Kickstart"?
>
> Could anyone share their nuggets of wisdom or point me to resources where
> this may already exist?
>
> Thanks!
>
> Best regards,
>
> Les
>

Re: 99.999% uptime - Operations Best Practices?

Posted by William Oberman <ob...@civicscience.com>.

Attached are the day and week views of one of the cassandra boxes.  I think
it's obvious where the OOM happened :-)

To Thoku's points:
-I'm running with Sun, though I had to explicitly list the RMI hostname (==
the IP address) to allow JMX to work
-I didn't install JNA, should I be worried? :-)
-Amazon larges don't have swap, and I didn't explicitly enable it
-I don't run nodetool repair regularly, but my system is append only, and
the docs seem to indicate repair was for cleaning up deletes

Other things:
-Connecting Jconsole through a firewall is basically impossible.  I finally
had to install a VPN (I used hamachi).
-I use nagios for alerts
-I'm writing my backup procedure now, but my approach is (for AWS):
  nodetool snapshot
  if(ebs.size < snapshot.size) if(ebs is null) create ebs = 2xsize else
resize volume to 2x size
  rsync snapshot to ebs
  take aws snapshot of ebs

I'm sure I'll think of more stuff later.

will

On Wed, Jun 22, 2011 at 8:17 PM, Les Hazlewood <le...@katasoft.com> wrote:

> Hi Scott,
>
> First, let me say that this email was amazing - I'm always appreciative of
> the time that anyone puts into mailing list replies, especially ones as
> thorough, well-thought and articulated as this one.  I'm a firm believer
> that these types of replies reflect a strong and durable open-source
> community.  You, sir, are a bad ass :)  Thanks so much!
>
> As for the '5 9s' comment, I apologize for even writing that - it threw
> everyone off.  It was a shorthand way of saying "this data store is so
> critical to the product, that if it ever goes down entirely (as it did for
> one user of 4 nodes, all at the same time), then we're screwed."  I was
> hoping to trigger the 'hrm - what have we done ourselves to work to that
> availability that wasn't easily represented in the documentation' train of
> thought.  It proved to be a red herring however, so I apologize for even
> bringing it up.
>
> Thanks *very* much for the reply.  I'll be sure to follow up with the list
> as I come across any particular issues and I'll also report my own findings
> in the interest of (hopefully) being beneficial to anyone in the future.
>
> Cheers,
>
> Les
>
>
> On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas <
> cscotta@urbanairship.com> wrote:
>
>> Hi Les,
>>
>> I wanted to offer a couple thoughts on where to start and strategies for
>> approaching development and deployment with reliability in mind.
>>
>> One way that we've found to more productively think about the reliability
>> of our data tier is to focus our thoughts away from a concept of "uptime or
>> *x* nines" toward one of "error rates." Ryan mentioned that "it depends,"
>> and while brief, this is actually a very correct comment. Perhaps I can help
>> elaborate.
>>
>> Failures in systems distributed across multiple systems in multiple
>> datacenters can rarely be described in terms of binary uptime guarantees
>> (e.g., either everything is up or everything is down). Instead, certain
>> nodes may be unavailable at certain times, but given appropriate read and
>> write parameters (and their implicit tradeoffs), these service interruptions
>> may remain transparent.
>>
>> Cassandra provides a variety of tools to allow you to tune these, two of
>> the most important of which are the consistency level for reads and writes
>> and your replication factor. I'm sure you're  familiar with these, but
>> mention them because thinking hard about the tradeoffs you're willing to
>> make in terms of consistency and replication may heavily impact your
>> operational experience if availability is of utmost importance.
>>
>> Of course, the single-node operational story is very important as well.
>> Ryan's "it depends" comment here takes on painful significance for myself,
>> as we've found that the manner in which read and write loads vary, their
>> duration, and intensity can have very different operational profiles and
>> failure modes. If relaxed consistency is acceptable for your reads and
>> writes, you'll likely find querying with CL.ONE to be more "available" than
>> QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is
>> economical for you to provision extra nodes for a higher replication factor,
>> you will increase your ability to continue reading and writing in the event
>> of single- or multiple-node failures.
>>
>> One of the prime challenges we've faced is reducing the frequency and
>> intensity of full garbage collections in the JVM, which tend to result in
>> single-node unavailability. Thanks to help from Jonathan Ellis and Peter
>> Schuller (along with a fair amount of elbow grease ourselves), we've worked
>> through several of these issues and have arrived at a steady state that
>> leaves the ring happy even under load. We've not found GC tuning to bring
>> night-and-day differences outside of resolving the STW collections, but the
>> difference is noticeable.
>>
>> Occasionally, these issues will result from Cassandra's behavior itself;
>> documented APIs such as querying for the count of all columns associated
>> with a key will materialize the row across all nodes being queried. Once
>> when issuing a "count" query for a key that had around 300k columns at
>> CL.QUORUM, we knocked three nodes out of our ring by triggering a
>> stop-the-world collection that lasted about 30 seconds, so watch out for
>> things like that.
>>
>> Some of the other tuning knobs available to you involve tradeoffs such as
>> when to flush memtables or to trigger compactions, both of which are
>> somewhat intensive operations that can strain a cluster under heavy read or
>> write load, but which are equally necessary for the cluster to remain in
>> operation. If you find yourself pushing hard against these tradeoffs and
>> attempting to navigate a path between icebergs, it's very likely that the
>> best answer to the problem is "more or more powerful hardware."
>>
>> But a lot of this is tacit knowledge, which often comes through a bit of
>> pain but is hopefully operationally transparent to your users.  Things that
>> you discover once the system is live in operation and your monitoring is
>> providing continuous feedback about the ring's health. This is where Sasha's
>> point becomes so critical -- having advanced early-warning systems in place,
>> watching monitoring and graphs closely even when everything's fine, and
>> beginning to understand how it *likes* to operate and what it tends to do
>> will give you a huge leg up on your reliability and allow you to react to
>> issues in the ring before they present operational impact.
>>
>> You mention that you've been building HA systems for a long time --
>> indeed, far longer than I have, so I'm sure that you're also aware that
>> good, solid "up/down" binaries are hard to come by, that none of this is
>> easy, and that while some pointers are available (the defaults are actually
>> quite good), it's essentially impossible to offer "the best production
>> defaults" because they vary wildly based on your hardware, ring
>> configuration, and read/write load and query patterns.
>>
>> To that end, you might find it more productive to begin with the defaults
>> as you develop your system, and let the ring tell you how it's feeling as
>> you begin load testing. Once you have stressed it to the point of failure,
>> you'll see how it failed and either be able to isolate the cause and begin
>> planning to handle that mode, or better yet, understand your maximum
>> capacity limits given your current hardware and fire off a purchase order
>> the second you see spikes nearing 80% of the total measured capacity in
>> production (or apply lessons you've learned in capacity planning as
>> appropriate, of course).
>>
>> Cassandra's a great system, but you may find that it requires a fair
>> amount of active operational involvement and monitoring -- like any
>> distributed system -- to maintain in a highly-reliable fashion. Each of
>> those nines implies extra time and operational cost, hopefully within the
>> boundaries of the revenue stream the system is expected to support.
>>
>> Pardon the long e-mail and for waxing a bit philosophical. I hope this
>> provides some food for thought.
>>
>> - Scott
>>
>> ---
>>
>> C. Scott Andreas
>> Engineer, Urban Airship, Inc.
>> http://www.urbanairship.com
>>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Hi Scott,

First, let me say that this email was amazing - I'm always appreciative of
the time that anyone puts into mailing list replies, especially ones as
thorough, well-thought and articulated as this one.  I'm a firm believer
that these types of replies reflect a strong and durable open-source
community.  You, sir, are a bad ass :)  Thanks so much!

As for the '5 9s' comment, I apologize for even writing that - it threw
everyone off.  It was a shorthand way of saying "this data store is so
critical to the product, that if it ever goes down entirely (as it did for
one user of 4 nodes, all at the same time), then we're screwed."  I was
hoping to trigger the 'hrm - what have we done ourselves to work to that
availability that wasn't easily represented in the documentation' train of
thought.  It proved to be a red herring however, so I apologize for even
bringing it up.

Thanks *very* much for the reply.  I'll be sure to follow up with the list
as I come across any particular issues and I'll also report my own findings
in the interest of (hopefully) being beneficial to anyone in the future.

Cheers,

Les

On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas
<cs...@urbanairship.com>wrote:

> Hi Les,
>
> I wanted to offer a couple thoughts on where to start and strategies for
> approaching development and deployment with reliability in mind.
>
> One way that we've found to more productively think about the reliability
> of our data tier is to focus our thoughts away from a concept of "uptime or
> *x* nines" toward one of "error rates." Ryan mentioned that "it depends,"
> and while brief, this is actually a very correct comment. Perhaps I can help
> elaborate.
>
> Failures in systems distributed across multiple systems in multiple
> datacenters can rarely be described in terms of binary uptime guarantees
> (e.g., either everything is up or everything is down). Instead, certain
> nodes may be unavailable at certain times, but given appropriate read and
> write parameters (and their implicit tradeoffs), these service interruptions
> may remain transparent.
>
> Cassandra provides a variety of tools to allow you to tune these, two of
> the most important of which are the consistency level for reads and writes
> and your replication factor. I'm sure you're  familiar with these, but
> mention them because thinking hard about the tradeoffs you're willing to
> make in terms of consistency and replication may heavily impact your
> operational experience if availability is of utmost importance.
>
> Of course, the single-node operational story is very important as well.
> Ryan's "it depends" comment here takes on painful significance for myself,
> as we've found that the manner in which read and write loads vary, their
> duration, and intensity can have very different operational profiles and
> failure modes. If relaxed consistency is acceptable for your reads and
> writes, you'll likely find querying with CL.ONE to be more "available" than
> QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is
> economical for you to provision extra nodes for a higher replication factor,
> you will increase your ability to continue reading and writing in the event
> of single- or multiple-node failures.
>
> One of the prime challenges we've faced is reducing the frequency and
> intensity of full garbage collections in the JVM, which tend to result in
> single-node unavailability. Thanks to help from Jonathan Ellis and Peter
> Schuller (along with a fair amount of elbow grease ourselves), we've worked
> through several of these issues and have arrived at a steady state that
> leaves the ring happy even under load. We've not found GC tuning to bring
> night-and-day differences outside of resolving the STW collections, but the
> difference is noticeable.
>
> Occasionally, these issues will result from Cassandra's behavior itself;
> documented APIs such as querying for the count of all columns associated
> with a key will materialize the row across all nodes being queried. Once
> when issuing a "count" query for a key that had around 300k columns at
> CL.QUORUM, we knocked three nodes out of our ring by triggering a
> stop-the-world collection that lasted about 30 seconds, so watch out for
> things like that.
>
> Some of the other tuning knobs available to you involve tradeoffs such as
> when to flush memtables or to trigger compactions, both of which are
> somewhat intensive operations that can strain a cluster under heavy read or
> write load, but which are equally necessary for the cluster to remain in
> operation. If you find yourself pushing hard against these tradeoffs and
> attempting to navigate a path between icebergs, it's very likely that the
> best answer to the problem is "more or more powerful hardware."
>
> But a lot of this is tacit knowledge, which often comes through a bit of
> pain but is hopefully operationally transparent to your users.  Things that
> you discover once the system is live in operation and your monitoring is
> providing continuous feedback about the ring's health. This is where Sasha's
> point becomes so critical -- having advanced early-warning systems in place,
> watching monitoring and graphs closely even when everything's fine, and
> beginning to understand how it *likes* to operate and what it tends to do
> will give you a huge leg up on your reliability and allow you to react to
> issues in the ring before they present operational impact.
>
> You mention that you've been building HA systems for a long time -- indeed,
> far longer than I have, so I'm sure that you're also aware that good, solid
> "up/down" binaries are hard to come by, that none of this is easy, and that
> while some pointers are available (the defaults are actually quite good),
> it's essentially impossible to offer "the best production defaults" because
> they vary wildly based on your hardware, ring configuration, and read/write
> load and query patterns.
>
> To that end, you might find it more productive to begin with the defaults
> as you develop your system, and let the ring tell you how it's feeling as
> you begin load testing. Once you have stressed it to the point of failure,
> you'll see how it failed and either be able to isolate the cause and begin
> planning to handle that mode, or better yet, understand your maximum
> capacity limits given your current hardware and fire off a purchase order
> the second you see spikes nearing 80% of the total measured capacity in
> production (or apply lessons you've learned in capacity planning as
> appropriate, of course).
>
> Cassandra's a great system, but you may find that it requires a fair amount
> of active operational involvement and monitoring -- like any distributed
> system -- to maintain in a highly-reliable fashion. Each of those nines
> implies extra time and operational cost, hopefully within the boundaries of
> the revenue stream the system is expected to support.
>
> Pardon the long e-mail and for waxing a bit philosophical. I hope this
> provides some food for thought.
>
> - Scott
>
> ---
>
> C. Scott Andreas
> Engineer, Urban Airship, Inc.
> http://www.urbanairship.com
>

Re: 99.999% uptime - Operations Best Practices?

Posted by "C. Scott Andreas" <cs...@urbanairship.com>.

Hi Les,

I wanted to offer a couple thoughts on where to start and strategies for approaching development and deployment with reliability in mind.

One way that we've found to more productively think about the reliability of our data tier is to focus our thoughts away from a concept of "uptime or x nines" toward one of "error rates." Ryan mentioned that "it depends," and while brief, this is actually a very correct comment. Perhaps I can help elaborate.

Failures in systems distributed across multiple systems in multiple datacenters can rarely be described in terms of binary uptime guarantees (e.g., either everything is up or everything is down). Instead, certain nodes may be unavailable at certain times, but given appropriate read and write parameters (and their implicit tradeoffs), these service interruptions may remain transparent.

Cassandra provides a variety of tools to allow you to tune these, two of the most important of which are the consistency level for reads and writes and your replication factor. I'm sure you're  familiar with these, but mention them because thinking hard about the tradeoffs you're willing to make in terms of consistency and replication may heavily impact your operational experience if availability is of utmost importance.

Of course, the single-node operational story is very important as well. Ryan's "it depends" comment here takes on painful significance for myself, as we've found that the manner in which read and write loads vary, their duration, and intensity can have very different operational profiles and failure modes. If relaxed consistency is acceptable for your reads and writes, you'll likely find querying with CL.ONE to be more "available" than QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is economical for you to provision extra nodes for a higher replication factor, you will increase your ability to continue reading and writing in the event of single- or multiple-node failures.

One of the prime challenges we've faced is reducing the frequency and intensity of full garbage collections in the JVM, which tend to result in single-node unavailability. Thanks to help from Jonathan Ellis and Peter Schuller (along with a fair amount of elbow grease ourselves), we've worked through several of these issues and have arrived at a steady state that leaves the ring happy even under load. We've not found GC tuning to bring night-and-day differences outside of resolving the STW collections, but the difference is noticeable.

Occasionally, these issues will result from Cassandra's behavior itself; documented APIs such as querying for the count of all columns associated with a key will materialize the row across all nodes being queried. Once when issuing a "count" query for a key that had around 300k columns at CL.QUORUM, we knocked three nodes out of our ring by triggering a stop-the-world collection that lasted about 30 seconds, so watch out for things like that.

Some of the other tuning knobs available to you involve tradeoffs such as when to flush memtables or to trigger compactions, both of which are somewhat intensive operations that can strain a cluster under heavy read or write load, but which are equally necessary for the cluster to remain in operation. If you find yourself pushing hard against these tradeoffs and attempting to navigate a path between icebergs, it's very likely that the best answer to the problem is "more or more powerful hardware."

But a lot of this is tacit knowledge, which often comes through a bit of pain but is hopefully operationally transparent to your users.  Things that you discover once the system is live in operation and your monitoring is providing continuous feedback about the ring's health. This is where Sasha's point becomes so critical -- having advanced early-warning systems in place, watching monitoring and graphs closely even when everything's fine, and beginning to understand how it likes to operate and what it tends to do will give you a huge leg up on your reliability and allow you to react to issues in the ring before they present operational impact.

You mention that you've been building HA systems for a long time -- indeed, far longer than I have, so I'm sure that you're also aware that good, solid "up/down" binaries are hard to come by, that none of this is easy, and that while some pointers are available (the defaults are actually quite good), it's essentially impossible to offer "the best production defaults" because they vary wildly based on your hardware, ring configuration, and read/write load and query patterns.

To that end, you might find it more productive to begin with the defaults as you develop your system, and let the ring tell you how it's feeling as you begin load testing. Once you have stressed it to the point of failure, you'll see how it failed and either be able to isolate the cause and begin planning to handle that mode, or better yet, understand your maximum capacity limits given your current hardware and fire off a purchase order the second you see spikes nearing 80% of the total measured capacity in production (or apply lessons you've learned in capacity planning as appropriate, of course).

Cassandra's a great system, but you may find that it requires a fair amount of active operational involvement and monitoring -- like any distributed system -- to maintain in a highly-reliable fashion. Each of those nines implies extra time and operational cost, hopefully within the boundaries of the revenue stream the system is expected to support.

Pardon the long e-mail and for waxing a bit philosophical. I hope this provides some food for thought.

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

On Jun 22, 2011, at 4:16 PM, Les Hazlewood wrote:

> On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <wo...@gmail.com> wrote:
> you have to use multiple data centers to really deliver 4 or 5 9's of service
> 
> We do, hence my question, as well as my choice of Cassandra :)
> 
> Best,
> 
> Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

On Wed, Jun 22, 2011 at 4:35 PM, mcasandra <mo...@gmail.com> wrote:

> might be helpful which varies from env to env. That's why I suggest look at
> the comments in cassandra.yaml and see which are applicable in your
> scenario. I learn something new everytime I read it.
>

Yep, and this was awesome - thanks very much for the reply - very helpful.

> BTW: Can you be clear as to what kind of recommendations are you referring
> to? NetworkToplogy, how many copies to store, uptime, load balancing,
> request routing when on DC is down? If you ask specific questions you might
> get better response.

Yes, this was my fault in not being specific, but I intentionally left it
open to see if anyone wanted to bring up something specific to their
environment that they thought would be valuable ('e.g. when our nodes got to
95% memory utilization, we find that GC behavior is doing X. Setting the JVM
option of 'foo' helped us reduce problem Y').

I was mainly looking initially for what folks thought were satisfactory
initial JVM/GC and *nix OS settings for a production node (e.g. 8 cores w/
64 gig ram, or an EC2 'large' or 'XL' node).  E.g. what collector was used,
and why, whether folks have used the standard CMS collector or if they've
tried the G1 collector and what settings made them happy after testing...

Those kinds of things.  Call it a tiny 'case study' if you will.  Network
topology I thought I'd leave for a whole 'nuther discussion :)

As an aside, I definitely plan to publish our actual JVM and OS settings and
operational procedures once we find a happy medium based on our application
in the event that it might help someone else.

Thanks again!

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by mcasandra <mo...@gmail.com>.

Les Hazlewood wrote:
> 
> I have architected, built and been responsible for systems that support
> 4-5
> 9s for years. 
> 

So have most of us. But probably by now it should be clear that no
technology can provide concrete recommendations. They can only provide what
might be helpful which varies from env to env. That's why I suggest look at
the comments in cassandra.yaml and see which are applicable in your
scenario. I learn something new everytime I read it.

BTW: Can you be clear as to what kind of recommendations are you referring
to? NetworkToplogy, how many copies to store, uptime, load balancing,
request routing when on DC is down? If you ask specific questions you might
get better response.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506565.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

I have architected, built and been responsible for systems that support 4-5
9s for years.  This discussion is not about how to do that generally.  It
was intended to be about concrete techniques that have been found valuable
when deploying Cassandra in HA environments beyond what is documented in [1]
and [2].

Cheers,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Peter Lin <wo...@gmail.com>.

so having multiple data centers is step 1 of 4/5 9's.

I've worked on some services that had 3-4 9's SLA. Getting there is
really tough as others have stated. you have to auditing built into
your service, capacity metrics, capacity planning, some kind of
real-time monitoring, staff to respond to alerts, plan for handling
system failures, training to handle outage and a dozen other things.

your best choice is to hire someone that has built a system that
supports 4-5 9's and patiently work to get there.

On Wed, Jun 22, 2011 at 7:16 PM, Les Hazlewood <le...@katasoft.com> wrote:
> On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <wo...@gmail.com> wrote:
>>
>> you have to use multiple data centers to really deliver 4 or 5 9's of
>> service
>
> We do, hence my question, as well as my choice of Cassandra :)
> Best,
> Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin <wo...@gmail.com> wrote:

> you have to use multiple data centers to really deliver 4 or 5 9's of
> service
>

We do, hence my question, as well as my choice of Cassandra :)

Best,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Forget the 5 9's - I apologize for even writing that.  It was my shorthand
way of saying 'this can never go down'.  I'm not asking for philosophical
advice - I've been doing large scale enterprise deployments for over 10
years.  I 'get' the 'it depends' and 'do your homework' philosophy.

All I'm asking for is concrete techniques that anyone might wish to share
that they've found valuable beyond what is currently written in the existing
operations documentation in [1] and [2].

If no one wants to share that, that's totally cool - no need to derail the
thread into a different discussion.

Thanks,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by mcasandra <mo...@gmail.com>.

In my opinion 5 9s don't matter. It's the number of impacted customers. You
might be down during peak for 5 mts causing 1000s of customer turn aways
while you might be down during night causing only few customer turn aways.

There is no magic bullet. It's all about learning and improving. You will
not get HA right away, but over period of time as you learn and improve you
will do better.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506511.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: 99.999% uptime - Operations Best Practices?

Posted by Peter Lin <wo...@gmail.com>.

you have to use multiple data centers to really deliver 4 or 5 9's of service


On Wed, Jun 22, 2011 at 7:09 PM, Edward Capriolo <ed...@gmail.com> wrote:
> Committing to that many 9s is going to be impossible since as far as I
> know no internet service provier will sla you more the 2 9s . You can
> not have more uptime then your isp.
>
> On Wednesday, June 22, 2011, Chris Burroughs <ch...@gmail.com> wrote:
>> On 06/22/2011 05:33 PM, Les Hazlewood wrote:
>>> Just to be clear:
>>>
>>> I understand that resources like [1] and [2] exist, and I've read them.  I'm
>>> just wondering if there are any 'gotchas' that might be missing from that
>>> documentation that should be considered and if there are any recommendations
>>> in addition to these documents.
>>>
>>> Thanks,
>>>
>>> Les
>>>
>>> [1] http://www.datastax.com/docs/0.8/operations/index
>>> [2] http://wiki.apache.org/cassandra/Operations
>>>
>>
>> Well if they new some secret gotcha the dutiful cassandra operators of
>> the world would update the wiki.
>>
>> The closest thing to a 'gotcha' is that neither Cassandra nor any other
>> technology is going to get you those nines.  Humans will need to commit
>> to reading the mailing lists, following JIRA, and understanding what the
>> code is doing.  And humans will need to commit to combine that
>> understanding with monitoring and alerting to figure out all of the "it
>> depends" for your particular case.
>>
>

Re: 99.999% uptime - Operations Best Practices?

Posted by Edward Capriolo <ed...@gmail.com>.

Committing to that many 9s is going to be impossible since as far as I
know no internet service provier will sla you more the 2 9s . You can
not have more uptime then your isp.

On Wednesday, June 22, 2011, Chris Burroughs <ch...@gmail.com> wrote:
> On 06/22/2011 05:33 PM, Les Hazlewood wrote:
>> Just to be clear:
>>
>> I understand that resources like [1] and [2] exist, and I've read them.  I'm
>> just wondering if there are any 'gotchas' that might be missing from that
>> documentation that should be considered and if there are any recommendations
>> in addition to these documents.
>>
>> Thanks,
>>
>> Les
>>
>> [1] http://www.datastax.com/docs/0.8/operations/index
>> [2] http://wiki.apache.org/cassandra/Operations
>>
>
> Well if they new some secret gotcha the dutiful cassandra operators of
> the world would update the wiki.
>
> The closest thing to a 'gotcha' is that neither Cassandra nor any other
> technology is going to get you those nines.  Humans will need to commit
> to reading the mailing lists, following JIRA, and understanding what the
> code is doing.  And humans will need to commit to combine that
> understanding with monitoring and alerting to figure out all of the "it
> depends" for your particular case.
>

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Yep, that was [2] on my existing list.  Thanks very much for actually
addressing my question - it is greatly appreciated!

If anyone else has examples they'd like to share (like their own cron
techniques, or JVM settings and why, etc), I'd love to hear them!

Best regards,

Les

On Wed, Jun 22, 2011 at 4:24 PM, mcasandra <mo...@gmail.com> wrote:

> Start with reading comments on cassandra.yaml and
> http://wiki.apache.org/cassandra/Operations
> http://wiki.apache.org/cassandra/Operations
>
> As far as I know there is no comprehensive list for performance tuning.
> More
> specifically common setting applicable to everyone. For most part issues
> revolve around compactions and GC tuning.
>

Re: 99.999% uptime - Operations Best Practices?

Posted by mcasandra <mo...@gmail.com>.

Start with reading comments on cassandra.yaml and 
http://wiki.apache.org/cassandra/Operations
http://wiki.apache.org/cassandra/Operations 

As far as I know there is no comprehensive list for performance tuning. More
specifically common setting applicable to everyone. For most part issues
revolve around compactions and GC tuning.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506529.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: 99.999% uptime - Operations Best Practices?

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/22/2011 10:03 PM, Edward Capriolo wrote:
> I have not read the original thread concerning the problem you mentioned.
> One way to avoid OOM is large amounts of RAM :) On a more serious note most
> OOM's are caused by setting caches or memtables too large. If the OOM was
> caused by a software bug, the cassandra devs are on the ball and move fast.
> I still suggest not jumping into a release right away. 

For what it's worth  that particular thread was about the kernel oom
killer, which is a good example of a the kind of gotcha that has caused
several people to chime in with the importance of monitoring both
Cassandra and the OS.

Re: 99.999% uptime - Operations Best Practices?

Posted by Karl Hiramoto <ka...@hiramoto.org>.

On 06/23/11 09:43, David Boxenhorn wrote:
> I think very high uptime, and very low data loss is achievable in
> Cassandra, but, for new users there are TONS of gotchas. You really
> have to know what you're doing, and I doubt that many people acquire
> that knowledge without making a lot of mistakes.
>
> I see above that most people are talking about configuration issues.
> But, the first thing that you will probably do, before you have any
> experience with Cassandra(!), is architect your system. Architecture
> is not easily changed when you bump into a gotcha, and for some reason
> you really have to search the literature well to find out about them.
> So, my contributions:
>
> The too many CFs problem. Cassandra doesn't do well with many column
> families. If you come from a relational world, a real application can
> easily have hundreds of tables. Even if you combine them into entities
> (which is the Cassandra way), you can easily end up with dozens of
> entities. The most natural thing for someone with a relational
> background is have one CF per entity, plus indexes according to your
> needs. Don't do it. You need to store multiple entities in the same
> CF. Group them together according to access patterns (i.e. when you
> use X,  you probably also need Y), and distinguish them by adding a
> prefix to their keys (e.g. entityName@key).

While avoiding too many CF's  is a good idea  I would also advise
against a very large  CF.   Keeping a CF size down, helps speed up
repair and compact.


--
Karl

Re: 99.999% uptime - Operations Best Practices?

Posted by David Boxenhorn <da...@citypath.com>.

I think very high uptime, and very low data loss is achievable in
Cassandra, but, for new users there are TONS of gotchas. You really
have to know what you're doing, and I doubt that many people acquire
that knowledge without making a lot of mistakes.

I see above that most people are talking about configuration issues.
But, the first thing that you will probably do, before you have any
experience with Cassandra(!), is architect your system. Architecture
is not easily changed when you bump into a gotcha, and for some reason
you really have to search the literature well to find out about them.
So, my contributions:

The too many CFs problem. Cassandra doesn't do well with many column
families. If you come from a relational world, a real application can
easily have hundreds of tables. Even if you combine them into entities
(which is the Cassandra way), you can easily end up with dozens of
entities. The most natural thing for someone with a relational
background is have one CF per entity, plus indexes according to your
needs. Don't do it. You need to store multiple entities in the same
CF. Group them together according to access patterns (i.e. when you
use X,  you probably also need Y), and distinguish them by adding a
prefix to their keys (e.g. entityName@key).

Don't use supercolumns, use composite columns. Supercolumns are
disfavored by the Cassandra community and are slowly being orphaned.
For example, secondary indexes don't work on supercolumns. Nor does
CQL. Bugs crop up with supercolumns that don't happen with regular
columns because internally there's a huge separate code base for
supercolumns, and every new feature is designed and implemented for
regular columns and then retrofitted for supercolumns (or not).

There should really be a database of gotchas somewhere, and how they
were solved...

On Thu, Jun 23, 2011 at 6:57 AM, Les Hazlewood <le...@katasoft.com> wrote:
> Edward,
> Thank you so much for this reply - this is great stuff, and I really
> appreciate it.
> You'll be happy to know that I've already pre-ordered your book.  I'm
> looking forward to it! (When is the ship date?)
> Best regards,
> Les
>
> On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>>
>> On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood <le...@katasoft.com> wrote:
>>>
>>> Hi Thoku,
>>> You were able to more concisely represent my intentions (and their
>>> reasoning) in this thread than I was able to do so myself.  Thanks!
>>>
>>> On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <th...@gmail.com> wrote:
>>>>
>>>> I think that Les's question was reasonable. Why *not* ask the community
>>>> for the 'gotchas'?
>>>> Whether the info is already documented or not, it could be an
>>>> opportunity to improve the documentation based on users' perception.
>>>> The "you just have to learn" responses are fair also, but that reminds
>>>> me of the days when running Oracle was a black art, and accumulated wisdom
>>>> made DBAs irreplaceable.
>>>
>>> Yes, this was my initial concern.  I know that Cassandra is still young,
>>> and I expect this to be the norm for a while, but I was hoping to make that
>>> process a bit easier (for me and anyone else reading this thread in the
>>> future).
>>>>
>>>> Some recommendations *are* documented, but they are dispersed / stale /
>>>> contradictory / or counter-intuitive.
>>>> Others have not been documented in the wiki nor in DataStax's doco, and
>>>> are instead learned anecdotally or The Hard Way.
>>>> For example, whether documented or not, some of the 'gotchas' that I
>>>> encountered when I first started working with Cassandra were:
>>>> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this, Jira says
>>>> that).
>>>> * Its not viable to run without JNA installed.
>>>> * Disable swap memory.
>>>> * Need to run nodetool repair on a regular basis.
>>>> I'm looking forward to Edward Capriolo's Cassandra book which Les will
>>>> probably find helpful.
>>>
>>> Thanks for linking to this.  I'm pre-ordering right away.
>>> And thanks for the pointers, they are exactly the kind of enumerated
>>> things I was looking to elicit.  These are the kinds of things that are hard
>>> to track down in a single place.  I think it'd be nice for the community to
>>> contribute this stuff to a single page ('best practices', 'checklist',
>>> whatever you want to call it).  It would certainly make things easier when
>>> getting started.
>>> Thanks again,
>>> Les
>>
>> Since I got a plug on the book I will chip in again to the thread :)
>>
>> Some things that were mentioned already:
>>
>> Install JNA absolutely (without JNA the snapshot command has to fork to
>> hard link the sstables, I have seen clients backoff from this). Also the
>> performance focused Cassandra devs always try to squeeze out performance by
>> utilizing more native features.
>>
>> OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
>> production, this way you get surprised less.
>>
>> Other stuff:
>>
>> RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0
>> has better performance, but if you lose a node your capacity is diminished,
>> rebuilding and rejoining a node involves more manpower more steps and more
>> chances for human error.
>>
>> Collect statistics on the normal system items CPU, disk (size and
>> utilization), memory. Then collect the JMX cassandra counters and understand
>> how they interact. For example record ReadCount and WriteCount per column
>> family, then use try to determine how this effects disk utilization. You can
>> use this for capacity planning. Then try using a key/row cache. Evaluate
>> again. Check the hit rate graph for your new cache. How did this effect your
>> disk? You want to head off anything that can be a performance killer like
>> traffic patterns changing or data growing significantly.
>>
>> Do not be short on hardware. I do not want to say "overbuy" but if uptime
>> is important have spares drives and servers and have room to grow.
>>
>> Balance that ring :)
>>
>> I have not read the original thread concerning the problem you mentioned.
>> One way to avoid OOM is large amounts of RAM :) On a more serious note most
>> OOM's are caused by setting caches or memtables too large. If the OOM was
>> caused by a software bug, the cassandra devs are on the ball and move fast.
>> I still suggest not jumping into a release right away. I know its hard to
>> live without counters or CQL since new things are super cool. But if you
>> want all those 9s your going to have to stay disciplined. Unless a release
>> has a fix for a problem you think you have, stay a minor or revision back,
>> or at least wait some time before upgrading to it, and do some internal
>> confidence testing before pulling the trigger on an update.
>>
>> Almost all usecases demand that repair be run regularly due to the nature
>> of distributed deletes.
>>
>> Other good tips, subscribe to all the mailing lists, and hang out in the
>> IRC channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
>> learning effect and you learn to fix or head off issues you never had.
>

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Edward,

Thank you so much for this reply - this is great stuff, and I really
appreciate it.

You'll be happy to know that I've already pre-ordered your book.  I'm
looking forward to it! (When is the ship date?)

Best regards,

Les

On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood <le...@katasoft.com> wrote:
>
>> Hi Thoku,
>>
>> You were able to more concisely represent my intentions (and their
>> reasoning) in this thread than I was able to do so myself.  Thanks!
>>
>> On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <th...@gmail.com> wrote:
>>
>>> I think that Les's question was reasonable. Why *not* ask the community
>>> for the 'gotchas'?
>>>
>>> Whether the info is already documented or not, it could be an opportunity
>>> to improve the documentation based on users' perception.
>>>
>>> The "you just have to learn" responses are fair also, but that reminds me
>>> of the days when running Oracle was a black art, and accumulated wisdom made
>>> DBAs irreplaceable.
>>>
>>
>> Yes, this was my initial concern.  I know that Cassandra is still young,
>> and I expect this to be the norm for a while, but I was hoping to make that
>> process a bit easier (for me and anyone else reading this thread in the
>> future).
>>
>> Some recommendations *are* documented, but they are dispersed / stale /
>>> contradictory / or counter-intuitive.
>>>
>>> Others have not been documented in the wiki nor in DataStax's doco, and
>>> are instead learned anecdotally or The Hard Way.
>>>
>>> For example, whether documented or not, some of the 'gotchas' that I
>>> encountered when I first started working with Cassandra were:
>>>
>>> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this<http://wiki.apache.org/cassandra/GettingStarted>
>>> , Jira says that <https://issues.apache.org/jira/browse/CASSANDRA-2441>
>>> ).
>>> * Its not viable to run without JNA installed.
>>> * Disable swap memory.
>>> * Need to run nodetool repair on a regular basis.
>>>
>>> I'm looking forward to Edward Capriolo's Cassandra book<https://www.packtpub.com/cassandra-apache-high-performance-cookbook/book> which
>>> Les will probably find helpful.
>>>
>>
>> Thanks for linking to this.  I'm pre-ordering right away.
>>
>> And thanks for the pointers, they are exactly the kind of enumerated
>> things I was looking to elicit.  These are the kinds of things that are hard
>> to track down in a single place.  I think it'd be nice for the community to
>> contribute this stuff to a single page ('best practices', 'checklist',
>> whatever you want to call it).  It would certainly make things easier when
>> getting started.
>>
>> Thanks again,
>>
>> Les
>>
>
> Since I got a plug on the book I will chip in again to the thread :)
>
> Some things that were mentioned already:
>
> Install JNA absolutely (without JNA the snapshot command has to fork to
> hard link the sstables, I have seen clients backoff from this). Also the
> performance focused Cassandra devs always try to squeeze out performance by
> utilizing more native features.
>
> OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
> production, this way you get surprised less.
>
> Other stuff:
>
> RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0
> has better performance, but if you lose a node your capacity is diminished,
> rebuilding and rejoining a node involves more manpower more steps and more
> chances for human error.
>
> Collect statistics on the normal system items CPU, disk (size and
> utilization), memory. Then collect the JMX cassandra counters and understand
> how they interact. For example record ReadCount and WriteCount per column
> family, then use try to determine how this effects disk utilization. You can
> use this for capacity planning. Then try using a key/row cache. Evaluate
> again. Check the hit rate graph for your new cache. How did this effect your
> disk? You want to head off anything that can be a performance killer like
> traffic patterns changing or data growing significantly.
>
> Do not be short on hardware. I do not want to say "overbuy" but if uptime
> is important have spares drives and servers and have room to grow.
>
> Balance that ring :)
>
> I have not read the original thread concerning the problem you mentioned.
> One way to avoid OOM is large amounts of RAM :) On a more serious note most
> OOM's are caused by setting caches or memtables too large. If the OOM was
> caused by a software bug, the cassandra devs are on the ball and move fast.
> I still suggest not jumping into a release right away. I know its hard to
> live without counters or CQL since new things are super cool. But if you
> want all those 9s your going to have to stay disciplined. Unless a release
> has a fix for a problem you think you have, stay a minor or revision back,
> or at least wait some time before upgrading to it, and do some internal
> confidence testing before pulling the trigger on an update.
>
> Almost all usecases demand that repair be run regularly due to the nature
> of distributed deletes.
>
> Other good tips, subscribe to all the mailing lists, and hang out in the
> IRC channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
> learning effect and you learn to fix or head off issues you never had.
>

Re: 99.999% uptime - Operations Best Practices?

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood <le...@katasoft.com> wrote:

> Hi Thoku,
>
> You were able to more concisely represent my intentions (and their
> reasoning) in this thread than I was able to do so myself.  Thanks!
>
> On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <th...@gmail.com> wrote:
>
>> I think that Les's question was reasonable. Why *not* ask the community
>> for the 'gotchas'?
>>
>> Whether the info is already documented or not, it could be an opportunity
>> to improve the documentation based on users' perception.
>>
>> The "you just have to learn" responses are fair also, but that reminds me
>> of the days when running Oracle was a black art, and accumulated wisdom made
>> DBAs irreplaceable.
>>
>
> Yes, this was my initial concern.  I know that Cassandra is still young,
> and I expect this to be the norm for a while, but I was hoping to make that
> process a bit easier (for me and anyone else reading this thread in the
> future).
>
> Some recommendations *are* documented, but they are dispersed / stale /
>> contradictory / or counter-intuitive.
>>
>> Others have not been documented in the wiki nor in DataStax's doco, and
>> are instead learned anecdotally or The Hard Way.
>>
>> For example, whether documented or not, some of the 'gotchas' that I
>> encountered when I first started working with Cassandra were:
>>
>> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this<http://wiki.apache.org/cassandra/GettingStarted>
>> , Jira says that <https://issues.apache.org/jira/browse/CASSANDRA-2441>).
>> * Its not viable to run without JNA installed.
>> * Disable swap memory.
>> * Need to run nodetool repair on a regular basis.
>>
>> I'm looking forward to Edward Capriolo's Cassandra book<https://www.packtpub.com/cassandra-apache-high-performance-cookbook/book> which
>> Les will probably find helpful.
>>
>
> Thanks for linking to this.  I'm pre-ordering right away.
>
> And thanks for the pointers, they are exactly the kind of enumerated things
> I was looking to elicit.  These are the kinds of things that are hard to
> track down in a single place.  I think it'd be nice for the community to
> contribute this stuff to a single page ('best practices', 'checklist',
> whatever you want to call it).  It would certainly make things easier when
> getting started.
>
> Thanks again,
>
> Les
>

Since I got a plug on the book I will chip in again to the thread :)

Some things that were mentioned already:

Install JNA absolutely (without JNA the snapshot command has to fork to hard
link the sstables, I have seen clients backoff from this). Also the
performance focused Cassandra devs always try to squeeze out performance by
utilizing more native features.

OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
production, this way you get surprised less.

Other stuff:

RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0 has
better performance, but if you lose a node your capacity is diminished,
rebuilding and rejoining a node involves more manpower more steps and more
chances for human error.

Collect statistics on the normal system items CPU, disk (size and
utilization), memory. Then collect the JMX cassandra counters and understand
how they interact. For example record ReadCount and WriteCount per column
family, then use try to determine how this effects disk utilization. You can
use this for capacity planning. Then try using a key/row cache. Evaluate
again. Check the hit rate graph for your new cache. How did this effect your
disk? You want to head off anything that can be a performance killer like
traffic patterns changing or data growing significantly.

Do not be short on hardware. I do not want to say "overbuy" but if uptime is
important have spares drives and servers and have room to grow.

Balance that ring :)

I have not read the original thread concerning the problem you mentioned.
One way to avoid OOM is large amounts of RAM :) On a more serious note most
OOM's are caused by setting caches or memtables too large. If the OOM was
caused by a software bug, the cassandra devs are on the ball and move fast.
I still suggest not jumping into a release right away. I know its hard to
live without counters or CQL since new things are super cool. But if you
want all those 9s your going to have to stay disciplined. Unless a release
has a fix for a problem you think you have, stay a minor or revision back,
or at least wait some time before upgrading to it, and do some internal
confidence testing before pulling the trigger on an update.

Almost all usecases demand that repair be run regularly due to the nature of
distributed deletes.

Other good tips, subscribe to all the mailing lists, and hang out in the IRC
channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
learning effect and you learn to fix or head off issues you never had.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Hi Thoku,

You were able to more concisely represent my intentions (and their
reasoning) in this thread than I was able to do so myself.  Thanks!

On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen <th...@gmail.com> wrote:

> I think that Les's question was reasonable. Why *not* ask the community for
> the 'gotchas'?
>
> Whether the info is already documented or not, it could be an opportunity
> to improve the documentation based on users' perception.
>
> The "you just have to learn" responses are fair also, but that reminds me
> of the days when running Oracle was a black art, and accumulated wisdom made
> DBAs irreplaceable.
>

Yes, this was my initial concern.  I know that Cassandra is still young, and
I expect this to be the norm for a while, but I was hoping to make that
process a bit easier (for me and anyone else reading this thread in the
future).

Some recommendations *are* documented, but they are dispersed / stale /
> contradictory / or counter-intuitive.
>
> Others have not been documented in the wiki nor in DataStax's doco, and are
> instead learned anecdotally or The Hard Way.
>
> For example, whether documented or not, some of the 'gotchas' that I
> encountered when I first started working with Cassandra were:
>
> * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this<http://wiki.apache.org/cassandra/GettingStarted>
> , Jira says that <https://issues.apache.org/jira/browse/CASSANDRA-2441>).
> * Its not viable to run without JNA installed.
> * Disable swap memory.
> * Need to run nodetool repair on a regular basis.
>
> I'm looking forward to Edward Capriolo's Cassandra book<https://www.packtpub.com/cassandra-apache-high-performance-cookbook/book> which
> Les will probably find helpful.
>

Thanks for linking to this.  I'm pre-ordering right away.

And thanks for the pointers, they are exactly the kind of enumerated things
I was looking to elicit.  These are the kinds of things that are hard to
track down in a single place.  I think it'd be nice for the community to
contribute this stuff to a single page ('best practices', 'checklist',
whatever you want to call it).  It would certainly make things easier when
getting started.

Thanks again,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Thoku Hansen <th...@gmail.com>.

I think that Les's question was reasonable. Why *not* ask the community for the 'gotchas'?

Whether the info is already documented or not, it could be an opportunity to improve the documentation based on users' perception.

The "you just have to learn" responses are fair also, but that reminds me of the days when running Oracle was a black art, and accumulated wisdom made DBAs irreplaceable.

Some recommendations *are* documented, but they are dispersed / stale / contradictory / or counter-intuitive.

Others have not been documented in the wiki nor in DataStax's doco, and are instead learned anecdotally or The Hard Way.

For example, whether documented or not, some of the 'gotchas' that I encountered when I first started working with Cassandra were:

* Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this, Jira says that).
* Its not viable to run without JNA installed.
* Disable swap memory.
* Need to run nodetool repair on a regular basis.

I'm looking forward to Edward Capriolo's Cassandra book which Les will probably find helpful.

On Jun 22, 2011, at 7:12 PM, Les Hazlewood wrote:

> >
> > [1] http://www.datastax.com/docs/0.8/operations/index
> > [2] http://wiki.apache.org/cassandra/Operations
> >
> 
> Well if they new some secret gotcha the dutiful cassandra operators of
> the world would update the wiki.
> 
> As I am new to the Cassandra community, I don't know how 'dutifully' this is maintained.  My questions were not unreasonable question given the nature of open-source documentation.  All I was looking for was what people thought were best practices based on their own production experience.
> 
> Telling me to read the mailing lists and follow the issue tracker and use monitoring software is all great and fine - and I do all of these things today already - but this is a philosophical recommendation that does not actually address my question.  So I chalk this up as an error on my side in not being clear in my question - my apologies.  Let me reformulate it :)
> 
> Does anyone out there have any concrete recommended techniques or insights in maintaining a HA Cassandra cluster that you've gained based on production experience beyond what is described in the 2 links above?
> 
> Thanks,
> 
> Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/23/2011 02:00 PM, Les Hazlewood wrote:
> This leads me to believe that Cassandra may not be a good idea for a primary
> OLTP data store.  For example "only create a user object if email foo is not
> already in use" or, more generally, "you can't create object X because one
> with an existing constraint already exists".
> 
> Is that a fair assumption?

I think so.  Lacking a built in T for OLTP the amount of hard thinking
you will have to do increases are you want to maintain more constraints.
 The obvious trade off is that instead of transaction you get that
distirbuted horizontal scalability stuff with Cassandra.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

>
> In the spirit of your re-formulated questions:
>>  - Read-before-write is a Cassandra anti-pattern, avoid it if at all
>> possible.
>>
>
> This leads me to believe that Cassandra may not be a good idea for a
> primary OLTP data store.  For example "only create a user object if email
> foo is not already in use" or, more generally, "you can't create object X
> because one with an existing constraint already exists".
>
> Is that a fair assumption?
>

Actually, this may not be true, at least using Digg and Twitter as examples.
 I'd assume those apps are far more read-heavy than they are write-heavy,
but I wouldn't know for sure.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

On Thu, Jun 23, 2011 at 10:41 AM, Chris Burroughs <chris.burroughs@gmail.com
> wrote:
>
>
> In the spirit of your re-formulated questions:
>  - Read-before-write is a Cassandra anti-pattern, avoid it if at all
> possible.
>

This leads me to believe that Cassandra may not be a good idea for a primary
OLTP data store.  For example "only create a user object if email foo is not
already in use" or, more generally, "you can't create object X because one
with an existing constraint already exists".

Is that a fair assumption?

Thanks,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Great stuff Chris - thanks so much for the feedback!

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/22/2011 07:12 PM, Les Hazlewood wrote:
> Telling me to read the mailing lists and follow the issue tracker and use
> monitoring software is all great and fine - and I do all of these things
> today already - but this is a philosophical recommendation that does not
> actually address my question.  So I chalk this up as an error on my side in
> not being clear in my question - my apologies.  Let me reformulate it :)

For what it's worth that was intended as a concrete suggestion.  We
adopted Cassandra a year ago when (IMHO) it was a mistake to do so it
without the willingness to develop sufficient in house expertise to
internally patch/fork/debug if needed.  Things are more mature now, best
practices more widespread etc., but you should judge that yourself.

In the spirit of your re-formulated questions:
 - Read-before-write is a Cassandra anti-pattern, avoid it if at all
possible.
 - Those optional lines in the env script about GC logging?  Uncomment
them on at least some of your boxes.
 - use MLOCKALL+mmap, or standard io, but not mmap without MLOCKALL.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

>
> >
> > [1] http://www.datastax.com/docs/0.8/operations/index
> > [2] http://wiki.apache.org/cassandra/Operations
> >
>
> Well if they new some secret gotcha the dutiful cassandra operators of
> the world would update the wiki.
>

As I am new to the Cassandra community, I don't know how 'dutifully' this is
maintained.  My questions were not unreasonable question given the nature of
open-source documentation.  All I was looking for was what people thought
were best practices based on their own production experience.

Telling me to read the mailing lists and follow the issue tracker and use
monitoring software is all great and fine - and I do all of these things
today already - but this is a philosophical recommendation that does not
actually address my question.  So I chalk this up as an error on my side in
not being clear in my question - my apologies.  Let me reformulate it :)

Does anyone out there have any concrete recommended techniques or insights
in maintaining a HA Cassandra cluster that you've gained based on production
experience beyond what is described in the 2 links above?

Thanks,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Chris Burroughs <ch...@gmail.com>.

On 06/22/2011 05:33 PM, Les Hazlewood wrote:
> Just to be clear:
> 
> I understand that resources like [1] and [2] exist, and I've read them.  I'm
> just wondering if there are any 'gotchas' that might be missing from that
> documentation that should be considered and if there are any recommendations
> in addition to these documents.
> 
> Thanks,
> 
> Les
> 
> [1] http://www.datastax.com/docs/0.8/operations/index
> [2] http://wiki.apache.org/cassandra/Operations
> 

Well if they new some secret gotcha the dutiful cassandra operators of
the world would update the wiki.

The closest thing to a 'gotcha' is that neither Cassandra nor any other
technology is going to get you those nines.  Humans will need to commit
to reading the mailing lists, following JIRA, and understanding what the
code is doing.  And humans will need to commit to combine that
understanding with monitoring and alerting to figure out all of the "it
depends" for your particular case.

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

Just to be clear:

I understand that resources like [1] and [2] exist, and I've read them.  I'm
just wondering if there are any 'gotchas' that might be missing from that
documentation that should be considered and if there are any recommendations
in addition to these documents.

Thanks,

Les

[1] http://www.datastax.com/docs/0.8/operations/index
[2] http://wiki.apache.org/cassandra/Operations

Re: 99.999% uptime - Operations Best Practices?

Posted by Will Oberman <ob...@civicscience.com>.

Sadly, they all went down within minutes of each other.

Sent from my iPhone

On Jun 22, 2011, at 6:16 PM, Sasha Dolgy <sd...@gmail.com> wrote:

> Implement monitoring and be proactive...that will stop you waking up  
> to a big surprise.  i'm sure there were symltoms leading up to all 4  
> nodes going down.  willing to wager that each node went down at  
> different times and not all went down at once...
>
> On Jun 22, 2011 11:50 PM, "Les Hazlewood" <le...@katasoft.com> wrote:
> > I understand that every environment is different and it always  
> 'depends' :)
> > But recommending settings and techniques based on an existing real
> > production environment (like the user's suggestion to run nodetool  
> repair as
> > a regular cron job) is always a better starting point for a new  
> Cassandra
> > evaluator than having to start from scratch.
> >
> > Ryan, do you have any 'seed' settings that you guys use for nodes at
> > Twitter?
> >
> > Are there any resources/write-ups beyond the two I've listed  
> already that
> > address some of these 'gotchas'? If those two links are in fact  
> the ideal
> > starting point, that's fine - but it appears that this may not be  
> the case
> > however based on the aforementioned user as well as the other who  
> helped him
> > who saw similar warning signs.
> >
> > I'm hoping for someone to dispel these reports based on what  
> people actually
> > do in production today. Any info/settings/recommendations based on  
> real
> > production environments would be appreciated!
> >
> > Thanks again,
> >
> > Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Sasha Dolgy <sd...@gmail.com>.

Implement monitoring and be proactive...that will stop you waking up to a
big surprise.  i'm sure there were symltoms leading up to all 4 nodes going
down.  willing to wager that each node went down at different times and not
all went down at once...
On Jun 22, 2011 11:50 PM, "Les Hazlewood" <le...@katasoft.com> wrote:
> I understand that every environment is different and it always 'depends'
:)
> But recommending settings and techniques based on an existing real
> production environment (like the user's suggestion to run nodetool repair
as
> a regular cron job) is always a better starting point for a new Cassandra
> evaluator than having to start from scratch.
>
> Ryan, do you have any 'seed' settings that you guys use for nodes at
> Twitter?
>
> Are there any resources/write-ups beyond the two I've listed already that
> address some of these 'gotchas'? If those two links are in fact the ideal
> starting point, that's fine - but it appears that this may not be the case
> however based on the aforementioned user as well as the other who helped
him
> who saw similar warning signs.
>
> I'm hoping for someone to dispel these reports based on what people
actually
> do in production today. Any info/settings/recommendations based on real
> production environments would be appreciated!
>
> Thanks again,
>
> Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Les Hazlewood <le...@katasoft.com>.

I understand that every environment is different and it always 'depends' :)
 But recommending settings and techniques based on an existing real
production environment (like the user's suggestion to run nodetool repair as
a regular cron job) is always a better starting point for a new Cassandra
evaluator than having to start from scratch.

Ryan, do you have any 'seed' settings that you guys use for nodes at
Twitter?

Are there any resources/write-ups beyond the two I've listed already that
address some of these 'gotchas'?  If those two links are in fact the ideal
starting point, that's fine - but it appears that this may not be the case
however based on the aforementioned user as well as the other who helped him
who saw similar warning signs.

I'm hoping for someone to dispel these reports based on what people actually
do in production today.  Any info/settings/recommendations based on real
production environments would be appreciated!

Thanks again,

Les

Re: 99.999% uptime - Operations Best Practices?

Posted by Ryan King <ry...@twitter.com>.

On Wed, Jun 22, 2011 at 2:24 PM, Les Hazlewood <le...@katasoft.com> wrote:
> I'm planning on using Cassandra as a product's core data store, and it is
> imperative that it never goes down or loses data, even in the event of a
> data center failure.  This uptime requirement ("five nines": 99.999% uptime)
> w/ WAN capabilities is largely what led me to choose Cassandra over other
> NoSQL products, given its history and 'from the ground up' design for such
> operational benefits.
> However, in a recent thread, a user indicated that all 4 of 4 of his
> Cassandra instances were down because the OS killed the Java processes due
> to memory starvation, and all 4 instances went down in a relatively short
> period of time of each other.  Another user helped out and replied that
> running 0.8 and nodetool repair on each node regularly via a cron job (once
> a day?) seems to work for him.
> Naturally this was disconcerting to read, given our needs for a Highly
> Available product - we'd be royally screwed if this ever happened to us.
>  But given Cassandra's history and it's current production use, I'm aware
> that this HA/uptime is being achieved today, and I believe it is certainly
> achievable.
> So, is there a collective set of guidelines or best practices to ensure this
> problem (or unavailability due to OOM) can be easily managed?
> Things like memory settings, initial GC recommendations, cron
> recommendations, ulimit settings, etc. that can be bundled up as a
> best-practices "Production Kickstart"?

Unfortunately most of these are in the category of "it depends".

-ryan

> Could anyone share their nuggets of wisdom or point me to resources where
> this may already exist?
> Thanks!
> Best regards,
> Les
>