You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Paul Pollack <pa...@klaviyo.com> on 2017/09/11 23:48:38 UTC

Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Hi,

We run 48 node cluster that stores counts in wide rows. Each node is using
roughly 1TB space on a 2TB EBS gp2 drive for data directory and
LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
streaming it is bombarded by a large number of compactions. We see CPU load
on the new node spike extremely high and CPU load on all the other nodes in
the cluster drop unreasonably low. Meanwhile our app's latency for writes
to this cluster average 10 seconds or greater. We've already tried
throttling compaction throughput to 1 mbps and we've always had
concurrent_compactors set to 2 but the disk is still saturated. In every
case we have had to shut down the Cassandra process on the new node to
resume acceptable operations.

We're currently upgrading all of our clients to use the 3.11.0 version of
the DataStax Python driver, which will allow us to add our next newly
bootstrapped node to a blacklist, hoping that if it doesn't accept writes
the rest of the cluster can serve them adequately (as is the case whenever
we turn down the bootstrapping node), and allow it to finish its
compactions.

We were also interested in hearing if anyone has had much luck using the
sstableofflinerelevel tool, and if this is a reasonable approach for our
issue.

One of my colleagues found a post where a user had a similar issue and
found that bloom filters had an extremely high false positive ratio, and
although I didn't check that during any of these attempts to bootstrap it
seems to me like if we have that many compactions to do we're likely to
observe that same thing.

Would appreciate any guidance anyone can offer.

Thanks,
Paul

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by Paul Pollack <pa...@klaviyo.com>.
Thanks again guys, this has been a major blocker for us and I think we've
made some major progress with your advice.

We have gone ahead with Lerh's suggestion and the cluster is operating much
more smoothly while the new node compacts. We read at quorum, so in the
event that we don't make it within the hinted handoff window, at least
there won't be inconsistent data from reads.

Kurt - what we've been observing is that after the node finishes getting
data streamed to it from other nodes, it will go into state UN and only
then start the compactions, in this case it has about 130 pending. When
it's still joining we don't see an I/O bottleneck. I think the reason this
may be an issue for us is because our nodes generally are not OK since
they're constantly maxing out their disk throughput and have long queues,
which is why we're trying to increase capacity by both adding nodes and
switching to RAIDed disks. Under normal operating circumstances they're
pushed to their limits, so I think when the node gets backed up on
compactions it really is enough to tip over the cluster.

That's helpful to know regarding sstableofflinerelevel, in my dry run it
did appear that it would shuffle even more SSTables into L0.

On Mon, Sep 11, 2017 at 11:50 PM, kurt greaves <ku...@instaclustr.com> wrote:

>
>> Kurt - We're on 3.7, and our approach was to try thorttling compaction
>> throughput as much as possible rather than the opposite. I had found some
>> resources that suggested unthrottling to let it get it over with, but
>> wasn't sure if this would really help in our situation since the I/O pipe
>> was already fully saturated.
>>
>
> You should unthrottle during bootstrap as the node won't receive read
> queries until it finishes streaming and joins the cluster. It seems
> unlikely that you'd be bottlenecked on I/O during the bootstrapping
> process. If you were, you'd certainly have bigger problems. The aim is to
> clear out the majority of compactions *before* the node joins and starts
> servicing reads. You might also want to increase concurrent_compactors.
> Typical advice is same as # CPU cores, but you might want to increase it
> for the bootstrapping period.
>
> sstableofflinerelevel could help but I wouldn't count on it. Usage is
> pretty straightforward but you may find that a lot of the existing SSTables
> in L0 just get put back in L0 anyways, which is where the main compaction
> backlog comes from. Plus you have to take the node offline which may not be
> ideal. In this case I would suggest the strategy Lerh suggested as being
> more viable.
>
> Regardless, if the rest of your nodes are OK (and you don't have RF1/using
> CL=ALL) Cassandra should pretty effectively route around the slow node so a
> single node backed up on compactions shouldn't be a big deal.
>

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by kurt greaves <ku...@instaclustr.com>.
>
>
> Kurt - We're on 3.7, and our approach was to try thorttling compaction
> throughput as much as possible rather than the opposite. I had found some
> resources that suggested unthrottling to let it get it over with, but
> wasn't sure if this would really help in our situation since the I/O pipe
> was already fully saturated.
>

You should unthrottle during bootstrap as the node won't receive read
queries until it finishes streaming and joins the cluster. It seems
unlikely that you'd be bottlenecked on I/O during the bootstrapping
process. If you were, you'd certainly have bigger problems. The aim is to
clear out the majority of compactions *before* the node joins and starts
servicing reads. You might also want to increase concurrent_compactors.
Typical advice is same as # CPU cores, but you might want to increase it
for the bootstrapping period.

sstableofflinerelevel could help but I wouldn't count on it. Usage is
pretty straightforward but you may find that a lot of the existing SSTables
in L0 just get put back in L0 anyways, which is where the main compaction
backlog comes from. Plus you have to take the node offline which may not be
ideal. In this case I would suggest the strategy Lerh suggested as being
more viable.

Regardless, if the rest of your nodes are OK (and you don't have RF1/using
CL=ALL) Cassandra should pretty effectively route around the slow node so a
single node backed up on compactions shouldn't be a big deal.

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by Paul Pollack <pa...@klaviyo.com>.
Thanks for the responses Lerh and Kurt!

Lerh - We had been considering those particular nodetool commands but were
hesitant to perform them on a production node without either testing
adequately in a dev environment or getting some feedback from someone who
knew what they were doing (such as yourself), so thank you for that! Your
point about the blacklist makes complete sense. So I think we'll probably
end up running those after the node finishes streaming and we confirm that
the blacklist is not improving latency. Just out of curiosity, do you have
any experience with sstableofflinerelevel? Is this something that would be
helpful to run with any kind of regularity?

Kurt - We're on 3.7, and our approach was to try thorttling compaction
throughput as much as possible rather than the opposite. I had found some
resources that suggested unthrottling to let it get it over with, but
wasn't sure if this would really help in our situation since the I/O pipe
was already fully saturated.

Best,
Paul

On Mon, Sep 11, 2017 at 9:16 PM, kurt greaves <ku...@instaclustr.com> wrote:

> What version are you using? There are improvements to streaming with LCS
> in 2.2.
> Also, are you unthrottling compaction throughput while the node is
> bootstrapping?
> ​
>

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by kurt greaves <ku...@instaclustr.com>.
What version are you using? There are improvements to streaming with LCS in
2.2.
Also, are you unthrottling compaction throughput while the node is
bootstrapping?
​

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by Lerh Chuan Low <le...@instaclustr.com>.
Hi Paul,

The new node will certainly have a lot of compactions to deal with being
LCS. Have you tried performing the following on the new node once it has
joined?

*nodetool disablebinary && nodetool disablethrift && nodetooldisablegossip*

This will disconnect Cassandra from the cluster, but not stop Cassandra
itself. At this point you can unthrottle compactions and let it compact
away. When it is done compacting, you can re-add it to the cluster and run
a repair if it has been over 3 hours. I don't think adding a blacklist will
help much because as long the data you insert replicates to the node (which
is slow) it will slow down the whole cluster.

As long as you have that node in the cluster, it will slow down everything.

Hope this helps you in some way :)

On 12 September 2017 at 09:48, Paul Pollack <pa...@klaviyo.com>
wrote:

> Hi,
>
> We run 48 node cluster that stores counts in wide rows. Each node is using
> roughly 1TB space on a 2TB EBS gp2 drive for data directory and
> LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
> use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
> cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
> streaming it is bombarded by a large number of compactions. We see CPU load
> on the new node spike extremely high and CPU load on all the other nodes in
> the cluster drop unreasonably low. Meanwhile our app's latency for writes
> to this cluster average 10 seconds or greater. We've already tried
> throttling compaction throughput to 1 mbps and we've always had
> concurrent_compactors set to 2 but the disk is still saturated. In every
> case we have had to shut down the Cassandra process on the new node to
> resume acceptable operations.
>
> We're currently upgrading all of our clients to use the 3.11.0 version of
> the DataStax Python driver, which will allow us to add our next newly
> bootstrapped node to a blacklist, hoping that if it doesn't accept writes
> the rest of the cluster can serve them adequately (as is the case whenever
> we turn down the bootstrapping node), and allow it to finish its
> compactions.
>
> We were also interested in hearing if anyone has had much luck using the
> sstableofflinerelevel tool, and if this is a reasonable approach for our
> issue.
>
> One of my colleagues found a post where a user had a similar issue and
> found that bloom filters had an extremely high false positive ratio, and
> although I didn't check that during any of these attempts to bootstrap it
> seems to me like if we have that many compactions to do we're likely to
> observe that same thing.
>
> Would appreciate any guidance anyone can offer.
>
> Thanks,
> Paul
>

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by Lerh Chuan Low <le...@instaclustr.com>.
Hi Paul,

Agh, I don't have any experience with sstableofflinerelevel. Maybe Kurt
does, sorry.

Also, if it wasn't obvious, to add back the node to the cluster once it is
done would be the 3 commands, with enable substituted for disable. It feels
like it will take some time to get through all the compactions, likely more
than the hinted handoff window, so do make sure you are querying Cassandra
with strong consistency after you rejoin the node. Good luck!

Lerh

On 12 September 2017 at 11:53, Aaron Wykoff <in...@gmail.com>
wrote:

> Unsubscribe
>
> On Mon, Sep 11, 2017 at 4:48 PM, Paul Pollack <pa...@klaviyo.com>
> wrote:
>
>> Hi,
>>
>> We run 48 node cluster that stores counts in wide rows. Each node is
>> using roughly 1TB space on a 2TB EBS gp2 drive for data directory and
>> LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
>> use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
>> cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
>> streaming it is bombarded by a large number of compactions. We see CPU load
>> on the new node spike extremely high and CPU load on all the other nodes in
>> the cluster drop unreasonably low. Meanwhile our app's latency for writes
>> to this cluster average 10 seconds or greater. We've already tried
>> throttling compaction throughput to 1 mbps and we've always had
>> concurrent_compactors set to 2 but the disk is still saturated. In every
>> case we have had to shut down the Cassandra process on the new node to
>> resume acceptable operations.
>>
>> We're currently upgrading all of our clients to use the 3.11.0 version of
>> the DataStax Python driver, which will allow us to add our next newly
>> bootstrapped node to a blacklist, hoping that if it doesn't accept writes
>> the rest of the cluster can serve them adequately (as is the case whenever
>> we turn down the bootstrapping node), and allow it to finish its
>> compactions.
>>
>> We were also interested in hearing if anyone has had much luck using the
>> sstableofflinerelevel tool, and if this is a reasonable approach for our
>> issue.
>>
>> One of my colleagues found a post where a user had a similar issue and
>> found that bloom filters had an extremely high false positive ratio, and
>> although I didn't check that during any of these attempts to bootstrap it
>> seems to me like if we have that many compactions to do we're likely to
>> observe that same thing.
>>
>> Would appreciate any guidance anyone can offer.
>>
>> Thanks,
>> Paul
>>
>
>

Re: Bootstrapping node on Cassandra 3.7 causes cluster-wide performance issues

Posted by Aaron Wykoff <in...@gmail.com>.
Unsubscribe

On Mon, Sep 11, 2017 at 4:48 PM, Paul Pollack <pa...@klaviyo.com>
wrote:

> Hi,
>
> We run 48 node cluster that stores counts in wide rows. Each node is using
> roughly 1TB space on a 2TB EBS gp2 drive for data directory and
> LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
> use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
> cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
> streaming it is bombarded by a large number of compactions. We see CPU load
> on the new node spike extremely high and CPU load on all the other nodes in
> the cluster drop unreasonably low. Meanwhile our app's latency for writes
> to this cluster average 10 seconds or greater. We've already tried
> throttling compaction throughput to 1 mbps and we've always had
> concurrent_compactors set to 2 but the disk is still saturated. In every
> case we have had to shut down the Cassandra process on the new node to
> resume acceptable operations.
>
> We're currently upgrading all of our clients to use the 3.11.0 version of
> the DataStax Python driver, which will allow us to add our next newly
> bootstrapped node to a blacklist, hoping that if it doesn't accept writes
> the rest of the cluster can serve them adequately (as is the case whenever
> we turn down the bootstrapping node), and allow it to finish its
> compactions.
>
> We were also interested in hearing if anyone has had much luck using the
> sstableofflinerelevel tool, and if this is a reasonable approach for our
> issue.
>
> One of my colleagues found a post where a user had a similar issue and
> found that bloom filters had an extremely high false positive ratio, and
> although I didn't check that during any of these attempts to bootstrap it
> seems to me like if we have that many compactions to do we're likely to
> observe that same thing.
>
> Would appreciate any guidance anyone can offer.
>
> Thanks,
> Paul
>