You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Dmitry Simonov <di...@gmail.com> on 2019/06/27 08:51:40 UTC

Bursts of Thrift threads make cluster unresponsive

Hello!

We've met several times the following problem.

Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
- all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
- cassandra's threads count raises from 300 to 1300 - 2000,most of them are
Thrift threads in java.net.SocketInputStream.socketRead0(Native Method)
method, count of other threads doesn't increase
- some Read messages are dropped
- read latency (p99.9) increases to 20-30 seconds
- there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks

Problem starts synchronously on all nodes of cluster.
I cannot tie this problem with increased load from clients ("read rate"
does't increase during the problem).
Also looks like there is no problem with disks (I/O latencies are OK).

Could anybody please give some advice in further troubleshooting?

-- 
Best Regards,
Dmitry Simonov

RE: [EXTERNAL] Re: Bursts of Thrift threads make cluster unresponsive

Posted by "Durity, Sean R" <SE...@homedepot.com>.
This sounds like a bad query or large partition. If a large partition is requested on multiple nodes (because of consistency level), it will pressure all those replica nodes. Then, as the cluster tries to adjust the rest of the load, the other nodes can get overwhelmed, too.

Look at cfstats to see if you have some large partitions. You may also see them as warnings in the system.log when they are getting compacted.

Also check for any ALLOW FILTERING queries in the code (or slow query stats, if you have them)

Sean


From: Dmitry Simonov <di...@gmail.com>
Sent: Thursday, June 27, 2019 5:22 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Bursts of Thrift threads make cluster unresponsive

> Is there an order in which the events you described happened, or is the order with which you presented them the order you notice things going wrong?

At first, threads count (Thrift) start increasing.
After 2 or 3 minutes they consume all CPU cores.
After that, simultaneously: message drops occur, read latency increases, active read tasks are noticed.

пт, 28 июн. 2019 г. в 01:40, Avinash Mandava <av...@vorstella.com>>:
Yeah i skimmed too fast, don't add more work if CPU is pegged, and if using thrift protocol NTR would not have values.

Is there an order in which the events you described happened, or is the order with which you presented them the order you notice things going wrong?

On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov <di...@gmail.com>> wrote:
Thanks for your reply!

> Have you tried increasing concurrent reads until you see more activity in disk?
When problem occurs, freshly created 1.2k - 2k Thrift threads consume all CPU on all cores.
Does increasing concurrent reads may help in this situation?

> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
This metric is 0 at all cluster nodes.

пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <av...@vorstella.com>>:
Have you tried increasing concurrent reads until you see more activity in disk? If you've always got 32 active reads and high pending reads it could just be dropping the reads because the queues are saturated. Could be artificially bottlenecking at the C* process level.

Also what does this metric show over time:

org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count



On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <di...@gmail.com>> wrote:
Hello!

We've met several times the following problem.

Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
- all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
- cassandra's threads count raises from 300 to 1300 - 2000,most of them are Thrift threads in java.net.SocketInputStream.socketRead0(Native Method) method, count of other threads doesn't increase
- some Read messages are dropped
- read latency (p99.9) increases to 20-30 seconds
- there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks

Problem starts synchronously on all nodes of cluster.
I cannot tie this problem with increased load from clients ("read rate" does't increase during the problem).
Also looks like there is no problem with disks (I/O latencies are OK).

Could anybody please give some advice in further troubleshooting?

--
Best Regards,
Dmitry Simonov


--
www.vorstella.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.vorstella.com&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=IrhXQuIn8JGa-Vinu6ypOlCQ9KNdTGRDGYGH493oG2Y&s=SD-Q3PZga9maMsMZRNkaxNOWFmZIn2EXKO6TGbF6Qe8&e=>
408 691 8402


--
Best Regards,
Dmitry Simonov


--
www.vorstella.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.vorstella.com&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=IrhXQuIn8JGa-Vinu6ypOlCQ9KNdTGRDGYGH493oG2Y&s=SD-Q3PZga9maMsMZRNkaxNOWFmZIn2EXKO6TGbF6Qe8&e=>
408 691 8402


--
Best Regards,
Dmitry Simonov

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Bursts of Thrift threads make cluster unresponsive

Posted by Dmitry Simonov <di...@gmail.com>.
> Is there an order in which the events you described happened, or is the
order with which you presented them the order you notice things going
wrong?

At first, threads count (Thrift) start increasing.
After 2 or 3 minutes they consume all CPU cores.
After that, simultaneously: message drops occur, read latency increases,
active read tasks are noticed.

пт, 28 июн. 2019 г. в 01:40, Avinash Mandava <av...@vorstella.com>:

> Yeah i skimmed too fast, don't add more work if CPU is pegged, and if
> using thrift protocol NTR would not have values.
>
> Is there an order in which the events you described happened, or is the
> order with which you presented them the order you notice things going
> wrong?
>
> On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov <di...@gmail.com>
> wrote:
>
>> Thanks for your reply!
>>
>> > Have you tried increasing concurrent reads until you see more activity
>> in disk?
>> When problem occurs, freshly created 1.2k - 2k Thrift threads consume all
>> CPU on all cores.
>> Does increasing concurrent reads may help in this situation?
>>
>> >
>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>> This metric is 0 at all cluster nodes.
>>
>> пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <av...@vorstella.com>:
>>
>>> Have you tried increasing concurrent reads until you see more activity
>>> in disk? If you've always got 32 active reads and high pending reads it
>>> could just be dropping the reads because the queues are saturated. Could be
>>> artificially bottlenecking at the C* process level.
>>>
>>> Also what does this metric show over time:
>>>
>>>
>>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>>>
>>>
>>>
>>> On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <di...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> We've met several times the following problem.
>>>>
>>>> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
>>>> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
>>>> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
>>>> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
>>>> Method) method, count of other threads doesn't increase
>>>> - some Read messages are dropped
>>>> - read latency (p99.9) increases to 20-30 seconds
>>>> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>>>>
>>>> Problem starts synchronously on all nodes of cluster.
>>>> I cannot tie this problem with increased load from clients ("read rate"
>>>> does't increase during the problem).
>>>> Also looks like there is no problem with disks (I/O latencies are OK).
>>>>
>>>> Could anybody please give some advice in further troubleshooting?
>>>>
>>>> --
>>>> Best Regards,
>>>> Dmitry Simonov
>>>>
>>>
>>>
>>> --
>>> www.vorstella.com
>>> 408 691 8402
>>>
>>
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>
>
> --
> www.vorstella.com
> 408 691 8402
>


-- 
Best Regards,
Dmitry Simonov

Re: Bursts of Thrift threads make cluster unresponsive

Posted by Avinash Mandava <av...@vorstella.com>.
Yeah i skimmed too fast, don't add more work if CPU is pegged, and if using
thrift protocol NTR would not have values.

Is there an order in which the events you described happened, or is the
order with which you presented them the order you notice things going
wrong?

On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov <di...@gmail.com>
wrote:

> Thanks for your reply!
>
> > Have you tried increasing concurrent reads until you see more activity
> in disk?
> When problem occurs, freshly created 1.2k - 2k Thrift threads consume all
> CPU on all cores.
> Does increasing concurrent reads may help in this situation?
>
> >
> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
> This metric is 0 at all cluster nodes.
>
> пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <av...@vorstella.com>:
>
>> Have you tried increasing concurrent reads until you see more activity in
>> disk? If you've always got 32 active reads and high pending reads it could
>> just be dropping the reads because the queues are saturated. Could be
>> artificially bottlenecking at the C* process level.
>>
>> Also what does this metric show over time:
>>
>>
>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>>
>>
>>
>> On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <di...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> We've met several times the following problem.
>>>
>>> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
>>> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
>>> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
>>> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
>>> Method) method, count of other threads doesn't increase
>>> - some Read messages are dropped
>>> - read latency (p99.9) increases to 20-30 seconds
>>> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>>>
>>> Problem starts synchronously on all nodes of cluster.
>>> I cannot tie this problem with increased load from clients ("read rate"
>>> does't increase during the problem).
>>> Also looks like there is no problem with disks (I/O latencies are OK).
>>>
>>> Could anybody please give some advice in further troubleshooting?
>>>
>>> --
>>> Best Regards,
>>> Dmitry Simonov
>>>
>>
>>
>> --
>> www.vorstella.com
>> 408 691 8402
>>
>
>
> --
> Best Regards,
> Dmitry Simonov
>


-- 
www.vorstella.com
408 691 8402

Re: Bursts of Thrift threads make cluster unresponsive

Posted by Dmitry Simonov <di...@gmail.com>.
Thanks for your reply!

> Have you tried increasing concurrent reads until you see more activity in
disk?
When problem occurs, freshly created 1.2k - 2k Thrift threads consume all
CPU on all cores.
Does increasing concurrent reads may help in this situation?

>
org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
This metric is 0 at all cluster nodes.

пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <av...@vorstella.com>:

> Have you tried increasing concurrent reads until you see more activity in
> disk? If you've always got 32 active reads and high pending reads it could
> just be dropping the reads because the queues are saturated. Could be
> artificially bottlenecking at the C* process level.
>
> Also what does this metric show over time:
>
>
> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>
>
>
> On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <di...@gmail.com>
> wrote:
>
>> Hello!
>>
>> We've met several times the following problem.
>>
>> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
>> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
>> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
>> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
>> Method) method, count of other threads doesn't increase
>> - some Read messages are dropped
>> - read latency (p99.9) increases to 20-30 seconds
>> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>>
>> Problem starts synchronously on all nodes of cluster.
>> I cannot tie this problem with increased load from clients ("read rate"
>> does't increase during the problem).
>> Also looks like there is no problem with disks (I/O latencies are OK).
>>
>> Could anybody please give some advice in further troubleshooting?
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>
>
> --
> www.vorstella.com
> 408 691 8402
>


-- 
Best Regards,
Dmitry Simonov

Re: Bursts of Thrift threads make cluster unresponsive

Posted by Avinash Mandava <av...@vorstella.com>.
Have you tried increasing concurrent reads until you see more activity in
disk? If you've always got 32 active reads and high pending reads it could
just be dropping the reads because the queues are saturated. Could be
artificially bottlenecking at the C* process level.

Also what does this metric show over time:

org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count



On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <di...@gmail.com>
wrote:

> Hello!
>
> We've met several times the following problem.
>
> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
> Method) method, count of other threads doesn't increase
> - some Read messages are dropped
> - read latency (p99.9) increases to 20-30 seconds
> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>
> Problem starts synchronously on all nodes of cluster.
> I cannot tie this problem with increased load from clients ("read rate"
> does't increase during the problem).
> Also looks like there is no problem with disks (I/O latencies are OK).
>
> Could anybody please give some advice in further troubleshooting?
>
> --
> Best Regards,
> Dmitry Simonov
>


-- 
www.vorstella.com
408 691 8402