You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Gediminas Blazys <Ge...@microsoft.com.INVALID> on 2020/10/28 13:20:25 UTC

RE: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

Hey,

Thanks chipping in Tomas. Could you describe what sort of workload is the big cluster receiving in terms of local C* reads, writes and client requests as well?

You mention repairs, how do you run them?

Gediminas

From: Tom van der Woerdt <to...@booking.com.INVALID>
Sent: Wednesday, October 28, 2020 14:35
To: user <us...@cassandra.apache.org>
Subject: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

Heya,

We're running version 3.11.7, can't use 3.11.8 as it won't even start (CASSANDRA-16091). Our policy is to use LCS for everything unless there's a good argument for a different compaction strategy (I don't think we have *any* STCS at all other than system keyspaces). Since our nodes are mostly on-prem they are generally oversized on cpu count, but when idle the cluster with 360 nodes ends up using less than two cores *peak* for background tasks like (full, weekly) repairs and tombstone compactions. That said they do get 32 logical threads because that's what the hardware ships with (-:

Haven't had major problems with Gossip over the years. I think we've had to run nodetool assassinate exactly once, a few years ago. Probably the only gossip related annoyance is that when you decommission all seed nodes Cassandra will happily run a single core at 100% trying to connect until you update the list of seeds, but that's really minor.

There's also one cluster that has 50TB nodes, 60 of them, storing reasonably large cells (using LCS, previously TWCS, both fine). Replacing a node takes a few days, but other than that it's not particularly problematic.

In my experience it's the small clusters that wake you up ;-)

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F&data=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158328199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE%3D&reserved=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie <jm...@apache.org>> wrote:

A few questions for you Tom if you have 30 seconds and care to disclose:

  1.  What version of C*?
  2.  What compaction strategy?
  3.  What's core count allocated per C* node?
  4.  Gossip give you any headaches / you have to be delicate there or does it behave itself?
Context: pmc/committer and I manage the OSS C* team at DataStax. We're doing a lot of thinking about how to generally improve the operator experience across the board for folks in the post 4.0 time frame, so data like the above (where things are going well at scale and why) is super useful to help feed into that effort.

Thanks!



On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt <to...@booking.com.invalid>> wrote:
Does 360 count? :-)

num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not too many problems either). Roughly 2.5TB per node, running on-prem on reasonably stable hardware so replacements end up happening once a week at most, and there's no particular change needed in the automation. Scaling up or down takes a while, but it doesn't appear to be slower than any other cluster. Configuration wise it's no different than a 5-node cluster either. Pretty uneventful tbh.

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbooking.com%2F&data=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r5lT3%2BeixKW2I1Lj6aI%2F6TUe0shDgqXLQHxgu%2FbwWTk%3D&reserved=0> BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F&data=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ba%2FIbybyy2p%2FsreGLBCxBedwl7mkgyBpy6%2BOPJWPcL8%3D&reserved=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys <Ge...@microsoft.com.invalid>> wrote:
Hello,

I wanted to seek out your opinion and experience.

Has anyone of you had a chance to run a Cassandra cluster of more than 350 nodes?
What are the major configuration considerations that you had to focus on? What number of vnodes did you use?
Once the cluster was up and running what would you have done differently?
Perhaps it would be more manageable to run multiple smaller clusters? Did you try this approach? What were the major challenges?

I don’t know if questions like that are allowed here but I’m really interested in what other folks ran into while running massive operations.

Gediminas


Re: Running and Managing Large Cassandra Clusters

Posted by Tom van der Woerdt <to...@booking.com.INVALID>.
That particular cluster exists for archival purposes, and as such gets a
very low amount of traffic (maybe 50000 queries per minute). So not
particularly helpful to answer your question :-) With that said, we've seen
in other clusters that scalability issues are much more likely to come from
hot partitions, hardware change rate (so basically any change to the token
ring, which we never do concurrently), repairs (though largely mitigated
now that we've switched to num_tokens=16), and connection count (sometimes
I'd consider it advisable to configure drivers to *not* establish a
connection to every node, but bound this and let the Cassandra coordinator
route requests instead).

The scalability in terms of client requests/reads/writes tends to be pretty
linear with the node count (and size of course), and on clusters that are
slightly smaller we can see this as well, easily doing hundreds of
thousands to a million queries per second.

As for repairs, we have our own tools for this, but it's fairly similar to
what Reaper does: we take all the ranges in the cluster and then schedule
them to be repaired over the course of a week. No manual `nodetool repair`
invocations, but specific single-range repairs.

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 2:20 PM Gediminas Blazys
<Ge...@microsoft.com.invalid> wrote:

> Hey,
>
>
>
> Thanks chipping in Tomas. Could you describe what sort of workload is the
> big cluster receiving in terms of local C* reads, writes and client
> requests as well?
>
>
>
> You mention repairs, how do you run them?
>
>
>
> Gediminas
>
>
>
> *From:* Tom van der Woerdt <to...@booking.com.INVALID>
> *Sent:* Wednesday, October 28, 2020 14:35
> *To:* user <us...@cassandra.apache.org>
> *Subject:* [EXTERNAL] Re: Running and Managing Large Cassandra Clusters
>
>
>
> Heya,
>
>
>
> We're running version 3.11.7, can't use 3.11.8 as it won't even start
> (CASSANDRA-16091). Our policy is to use LCS for everything unless there's a
> good argument for a different compaction strategy (I don't think we have
> *any* STCS at all other than system keyspaces). Since our nodes are mostly
> on-prem they are generally oversized on cpu count, but when idle the
> cluster with 360 nodes ends up using less than two cores *peak* for
> background tasks like (full, weekly) repairs and tombstone compactions.
> That said they do get 32 logical threads because that's what the hardware
> ships with (-:
>
>
>
> Haven't had major problems with Gossip over the years. I think we've had
> to run nodetool assassinate exactly once, a few years ago. Probably the
> only gossip related annoyance is that when you decommission all seed nodes
> Cassandra will happily run a single core at 100% trying to connect until
> you update the list of seeds, but that's really minor.
>
>
>
> There's also one cluster that has 50TB nodes, 60 of them, storing
> reasonably large cells (using LCS, previously TWCS, both fine). Replacing a
> node takes a few days, but other than that it's not particularly
> problematic.
>
>
>
> In my experience it's the small clusters that wake you up ;-)
>
>
> *Tom van der Woerdt*
>
> Senior Site Reliability Engineer
>
> Booking.com BV
> Vijzelstraat Amsterdam Netherlands 1017HL
>
> *[image: Booking.com]*
> <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fwww.booking.com*2F&data=04*7C01*7CGediminas.Blazys*40microsoft.com*7C49ac72df223f4567734408d87b3deb3b*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C637394853158328199*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSU!!FzMMvhmfRQ!8ClTsEZMT0xcNIA1_EUu62obyz5_K7M-6eMbcN-EoBpl70j7fNXjIJVae3wItRFQzhgzIsc$>
>
> Making it easier for everyone to experience the world since 1996
>
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
>
>
>
> On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie <jm...@apache.org>
> wrote:
>
> A few questions for you Tom if you have 30 seconds and care to disclose:
>
>    1. What version of C*?
>    2. What compaction strategy?
>    3. What's core count allocated per C* node?
>    4. Gossip give you any headaches / you have to be delicate there or
>    does it behave itself?
>
> Context: pmc/committer and I manage the OSS C* team at DataStax. We're
> doing a lot of thinking about how to generally improve the operator
> experience across the board for folks in the post 4.0 time frame, so data
> like the above (where things are going well at scale and why) is super
> useful to help feed into that effort.
>
>
>
> Thanks!
>
>
>
>
>
>
>
> On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt <
> tom.vanderwoerdt@booking.com.invalid> wrote:
>
> Does 360 count? :-)
>
>
>
> num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
> too many problems either). Roughly 2.5TB per node, running on-prem on
> reasonably stable hardware so replacements end up happening once a week at
> most, and there's no particular change needed in the automation. Scaling up
> or down takes a while, but it doesn't appear to be slower than any other
> cluster. Configuration wise it's no different than a 5-node cluster either.
> Pretty uneventful tbh.
>
>
> *Tom van der Woerdt*
>
> Senior Site Reliability Engineer
>
> Booking.com
> <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=http*3A*2F*2Fbooking.com*2F&data=04*7C01*7CGediminas.Blazys*40microsoft.com*7C49ac72df223f4567734408d87b3deb3b*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C637394853158338189*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=r5lT3*2BeixKW2I1Lj6aI*2F6TUe0shDgqXLQHxgu*2FbwWTk*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!FzMMvhmfRQ!8ClTsEZMT0xcNIA1_EUu62obyz5_K7M-6eMbcN-EoBpl70j7fNXjIJVae3wItRFQp_5tF-c$>
> BV
> Vijzelstraat Amsterdam Netherlands 1017HL
>
> *[image: Booking.com]*
> <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fwww.booking.com*2F&data=04*7C01*7CGediminas.Blazys*40microsoft.com*7C49ac72df223f4567734408d87b3deb3b*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C637394853158338189*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=Ba*2FIbybyy2p*2FsreGLBCxBedwl7mkgyBpy6*2BOPJWPcL8*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!FzMMvhmfRQ!8ClTsEZMT0xcNIA1_EUu62obyz5_K7M-6eMbcN-EoBpl70j7fNXjIJVae3wItRFQHkKZ7Y0$>
>
> Making it easier for everyone to experience the world since 1996
>
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
>
>
>
> On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys <
> Gediminas.Blazys@microsoft.com.invalid> wrote:
>
> Hello,
>
>
>
> I wanted to seek out your opinion and experience.
>
>
>
> Has anyone of you had a chance to run a Cassandra cluster of more than 350
> nodes?
>
> What are the major configuration considerations that you had to focus on?
> What number of vnodes did you use?
>
> Once the cluster was up and running what would you have done differently?
>
> Perhaps it would be more manageable to run multiple smaller clusters? Did
> you try this approach? What were the major challenges?
>
>
>
> I don’t know if questions like that are allowed here but I’m really
> interested in what other folks ran into while running massive operations.
>
>
>
> Gediminas
>
>
>
>