You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Nimi Wariboko Jr <ni...@channelmeter.com> on 2016/07/07 01:22:27 UTC

Debugging high tail read latencies (internal timeout)

Hi,

I've begun experiencing very high tail latencies across my clusters. While
Cassandra's internal metrics report <1ms read latencies, measuring
responses from within the driver in my applications (roundtrips of
query/execute frames), have 90% round trip times of up to a second for very
basic queries (SELECT a,b FROM table WHERE pk=x).

I've been studying the logs to try and get a handle on what could be going
wrong. I don't think there are GC issues, but the logs mention dropped
messages due to timeouts while the threadpools are nearly empty -

https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c

Relevant line:
REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
ms and Mean cross-node dropped latency: 0 ms

Are there any tools I can use to start to understand what is causing these
issues?

Nimi

Re: Debugging high tail read latencies (internal timeout)

Posted by daemeon reiydelle <da...@gmail.com>.

Hmm. Would you mind looking at your network interface (appropriate netstat
commands). if I am right you will be seeing packet errors, drops, retries,
packet out of window receives, etc.

What you may be missing is that you reported zero DROPPED latency. Not mean
LATENCY. Check your netstats. ANY VALUE CHANGE IS BAD (except total
read/write byte counts). If your network guys say otherwise, escalate to
someone who undertands tcp retry and sliding window.



*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 11:35 AM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi Nimi,
>
> My suspicions would probably lie somewhere between GC and large partitions.
>
> The first tool would probably be a trace but if you experience full client
> timeouts from dropped messages you may find it hard to find the issue. You
> can try running the trace with cqlsh's timeouts cranked all the way against
> the local node with CL=ONE to try to force the local machine to answer.
>
> What does nodetool tpstats report for dropped message counts? Are they
> very high? Primarily restricted to READ, or including MUTATION, etc. ?
>
> Are there specific PK's that trigger this behavior, either all the time or
> more consistently? That would finger either very large partition sizes or
> potentially bad hardware on a node. cfhistograms will show you various
> percentile partition sizes and your max as well.
>
> GC should be accessible via JMX and also you should have GCInspector logs
> in cassandra/system.log that should give you per-collection breakdowns.
>
> --Bryan
>
>
> On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr <ni...@channelmeter.com>
> wrote:
>
>> Hi,
>>
>> I've begun experiencing very high tail latencies across my clusters.
>> While Cassandra's internal metrics report <1ms read latencies, measuring
>> responses from within the driver in my applications (roundtrips of
>> query/execute frames), have 90% round trip times of up to a second for very
>> basic queries (SELECT a,b FROM table WHERE pk=x).
>>
>> I've been studying the logs to try and get a handle on what could be
>> going wrong. I don't think there are GC issues, but the logs mention
>> dropped messages due to timeouts while the threadpools are nearly empty -
>>
>> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>>
>> Relevant line:
>> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
>> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
>> ms and Mean cross-node dropped latency: 0 ms
>>
>> Are there any tools I can use to start to understand what is causing
>> these issues?
>>
>> Nimi
>>
>>
>

Re: Debugging high tail read latencies (internal timeout)

Posted by Bryan Cheng <br...@blockcypher.com>.

Hi Nimi,

My suspicions would probably lie somewhere between GC and large partitions.

The first tool would probably be a trace but if you experience full client
timeouts from dropped messages you may find it hard to find the issue. You
can try running the trace with cqlsh's timeouts cranked all the way against
the local node with CL=ONE to try to force the local machine to answer.

What does nodetool tpstats report for dropped message counts? Are they very
high? Primarily restricted to READ, or including MUTATION, etc. ?

Are there specific PK's that trigger this behavior, either all the time or
more consistently? That would finger either very large partition sizes or
potentially bad hardware on a node. cfhistograms will show you various
percentile partition sizes and your max as well.

GC should be accessible via JMX and also you should have GCInspector logs
in cassandra/system.log that should give you per-collection breakdowns.

--Bryan

On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr <ni...@channelmeter.com>
wrote:

> Hi,
>
> I've begun experiencing very high tail latencies across my clusters. While
> Cassandra's internal metrics report <1ms read latencies, measuring
> responses from within the driver in my applications (roundtrips of
> query/execute frames), have 90% round trip times of up to a second for very
> basic queries (SELECT a,b FROM table WHERE pk=x).
>
> I've been studying the logs to try and get a handle on what could be going
> wrong. I don't think there are GC issues, but the logs mention dropped
> messages due to timeouts while the threadpools are nearly empty -
>
> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>
> Relevant line:
> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
> ms and Mean cross-node dropped latency: 0 ms
>
> Are there any tools I can use to start to understand what is causing these
> issues?
>
> Nimi
>
>