You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mike Heffner <mi...@librato.com> on 2016/06/23 14:38:11 UTC

Ring connection timeouts with 2.2.6

Hi,

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
sitting at <25% CPU, doing mostly writes, and not showing any particular
long GC times/pauses. By all observed metrics the ring is healthy and
performing well.

However, we are noticing a pretty consistent number of connection timeouts
coming from the messaging service between various pairs of nodes in the
ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
per minute, usually between two pairs of nodes for several hours at a time.
It seems to occur for several hours at a time, then may stop or move to
other pairs of nodes in the ring. The metric
"Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
the nodes in the TotalTimeouts metric.

Looking at the debug log typically shows a large number of messages like
the following on one of the nodes:

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no
node appears to have time drift.

The network appears to be fine between nodes, with iperf tests showing that
we have a lot of headroom.

Any thoughts on what to look for? Can we increase thread count/pool sizes
for the messaging service?

Thanks,

Mike

-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Mike Heffner <mi...@librato.com>.

Garo,

No, we didn't notice any change in system load, just the expected spike in
packet counts.

Mike

On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen <ju...@gmail.com>
wrote:

> Just to pick this up: Did you see any system load spikes? I'm tracing a
> problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
> normal average load is around 3-4. So far I haven't found any good reason,
> but I'm going to try otc_coalescing_strategy: disabled tomorrow.
>
>  - Garo
>
> On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <mi...@librato.com> wrote:
>
>> Just to followup on this post with a couple of more data points:
>>
>> 1)
>>
>> We upgraded to 2.2.7 and did not see any change in behavior.
>>
>> 2)
>>
>> However, what *has* fixed this issue for us was disabling msg coalescing
>> by setting:
>>
>> otc_coalescing_strategy: DISABLED
>>
>> We were using the default setting before (time horizon I believe).
>>
>> We see periodic timeouts on the ring (once every few hours), but they are
>> brief and don't impact latency. With msg coalescing turned on we would see
>> these timeouts persist consistently after an initial spike. My guess is
>> that something in the coalescing logic is disturbed by the initial timeout
>> spike which leads to dropping all / high-percentage of all subsequent
>> traffic.
>>
>> We are planning to continue production use with msg coaleasing disabled
>> for now and may run tests in our staging environments to identify where the
>> coalescing is breaking this.
>>
>> Mike
>>
>> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <mi...@librato.com> wrote:
>>
>>> Jeff,
>>>
>>> Thanks, yeah we updated to the 2.16.4 driver version from source. I
>>> don't believe we've hit the bugs mentioned in earlier driver versions.
>>>
>>> Mike
>>>
>>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>>> depending on your instance types / hypervisor choice, you may want to
>>>> ensure you’re not seeing that bug.
>>>>
>>>>
>>>>
>>>> *From: *Mike Heffner <mi...@librato.com>
>>>> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>>> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>>>> *Cc: *Peter Norton <pc...@librato.com>
>>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>>
>>>>
>>>>
>>>> Jens,
>>>>
>>>>
>>>>
>>>> We haven't noticed any particular large GC operations or even
>>>> persistently high GC times.
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Could it be garbage collection occurring on nodes that are more heavily
>>>> loaded?
>>>>
>>>> Cheers,
>>>> Jens
>>>>
>>>>
>>>>
>>>> Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:
>>>>
>>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>>> disappear entirely for several hours and performance returns to normal.
>>>> It's as if something is leaking over time, but we haven't seen any
>>>> noticeable change in heap.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>>> performing well.
>>>>
>>>>
>>>>
>>>> However, we are noticing a pretty consistent number of connection
>>>> timeouts coming from the messaging service between various pairs of nodes
>>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>>> timeouts per minute, usually between two pairs of nodes for several hours
>>>> at a time. It seems to occur for several hours at a time, then may stop or
>>>> move to other pairs of nodes in the ring. The metric
>>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>>>> the nodes in the TotalTimeouts metric.
>>>>
>>>>
>>>>
>>>> Looking at the debug log typically shows a large number of messages
>>>> like the following on one of the nodes:
>>>>
>>>>
>>>>
>>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>>>> (ttl 0)
>>>>
>>>> We have cross node timeouts enabled, but ntp is running on all nodes
>>>> and no node appears to have time drift.
>>>>
>>>>
>>>>
>>>> The network appears to be fine between nodes, with iperf tests showing
>>>> that we have a lot of headroom.
>>>>
>>>>
>>>>
>>>> Any thoughts on what to look for? Can we increase thread count/pool
>>>> sizes for the messaging service?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mi...@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mi...@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jens Rantil
>>>> Backend Developer @ Tink
>>>>
>>>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>>>> For urgent matters you can reach me at +46-708-84 18 32.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mi...@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner <mi...@librato.com>
>>   Librato, Inc.
>>
>>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Juho Mäkinen <ju...@gmail.com>.

Just to pick this up: Did you see any system load spikes? I'm tracing a
problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
normal average load is around 3-4. So far I haven't found any good reason,
but I'm going to try otc_coalescing_strategy: disabled tomorrow.

 - Garo

On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <mi...@librato.com> wrote:

> Just to followup on this post with a couple of more data points:
>
> 1)
>
> We upgraded to 2.2.7 and did not see any change in behavior.
>
> 2)
>
> However, what *has* fixed this issue for us was disabling msg coalescing
> by setting:
>
> otc_coalescing_strategy: DISABLED
>
> We were using the default setting before (time horizon I believe).
>
> We see periodic timeouts on the ring (once every few hours), but they are
> brief and don't impact latency. With msg coalescing turned on we would see
> these timeouts persist consistently after an initial spike. My guess is
> that something in the coalescing logic is disturbed by the initial timeout
> spike which leads to dropping all / high-percentage of all subsequent
> traffic.
>
> We are planning to continue production use with msg coaleasing disabled
> for now and may run tests in our staging environments to identify where the
> coalescing is breaking this.
>
> Mike
>
> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <mi...@librato.com> wrote:
>
>> Jeff,
>>
>> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
>> believe we've hit the bugs mentioned in earlier driver versions.
>>
>> Mike
>>
>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>> depending on your instance types / hypervisor choice, you may want to
>>> ensure you’re not seeing that bug.
>>>
>>>
>>>
>>> *From: *Mike Heffner <mi...@librato.com>
>>> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> *Cc: *Peter Norton <pc...@librato.com>
>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>
>>>
>>>
>>> Jens,
>>>
>>>
>>>
>>> We haven't noticed any particular large GC operations or even
>>> persistently high GC times.
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Could it be garbage collection occurring on nodes that are more heavily
>>> loaded?
>>>
>>> Cheers,
>>> Jens
>>>
>>>
>>>
>>> Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:
>>>
>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>> disappear entirely for several hours and performance returns to normal.
>>> It's as if something is leaking over time, but we haven't seen any
>>> noticeable change in heap.
>>>
>>>
>>>
>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>> performing well.
>>>
>>>
>>>
>>> However, we are noticing a pretty consistent number of connection
>>> timeouts coming from the messaging service between various pairs of nodes
>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>> timeouts per minute, usually between two pairs of nodes for several hours
>>> at a time. It seems to occur for several hours at a time, then may stop or
>>> move to other pairs of nodes in the ring. The metric
>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>>> the nodes in the TotalTimeouts metric.
>>>
>>>
>>>
>>> Looking at the debug log typically shows a large number of messages like
>>> the following on one of the nodes:
>>>
>>>
>>>
>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>>> (ttl 0)
>>>
>>> We have cross node timeouts enabled, but ntp is running on all nodes and
>>> no node appears to have time drift.
>>>
>>>
>>>
>>> The network appears to be fine between nodes, with iperf tests showing
>>> that we have a lot of headroom.
>>>
>>>
>>>
>>> Any thoughts on what to look for? Can we increase thread count/pool
>>> sizes for the messaging service?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>> --
>>>
>>> Jens Rantil
>>> Backend Developer @ Tink
>>>
>>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>>> For urgent matters you can reach me at +46-708-84 18 32.
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>
>>
>>
>> --
>>
>>   Mike Heffner <mi...@librato.com>
>>   Librato, Inc.
>>
>>
>
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
>

Re: Ring connection timeouts with 2.2.6

Posted by Mike Heffner <mi...@librato.com>.

Just to followup on this post with a couple of more data points:

1)

We upgraded to 2.2.7 and did not see any change in behavior.

2)

However, what *has* fixed this issue for us was disabling msg coalescing by
setting:

otc_coalescing_strategy: DISABLED

We were using the default setting before (time horizon I believe).

We see periodic timeouts on the ring (once every few hours), but they are
brief and don't impact latency. With msg coalescing turned on we would see
these timeouts persist consistently after an initial spike. My guess is
that something in the coalescing logic is disturbed by the initial timeout
spike which leads to dropping all / high-percentage of all subsequent
traffic.

We are planning to continue production use with msg coaleasing disabled for
now and may run tests in our staging environments to identify where the
coalescing is breaking this.

Mike

On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <mi...@librato.com> wrote:

> Jeff,
>
> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
> believe we've hit the bugs mentioned in earlier driver versions.
>
> Mike
>
> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>> depending on your instance types / hypervisor choice, you may want to
>> ensure you’re not seeing that bug.
>>
>>
>>
>> *From: *Mike Heffner <mi...@librato.com>
>> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>> *Date: *Friday, July 1, 2016 at 1:10 PM
>> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
>> *Cc: *Peter Norton <pc...@librato.com>
>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>
>>
>>
>> Jens,
>>
>>
>>
>> We haven't noticed any particular large GC operations or even
>> persistently high GC times.
>>
>>
>>
>> Mike
>>
>>
>>
>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se> wrote:
>>
>> Hi,
>>
>> Could it be garbage collection occurring on nodes that are more heavily
>> loaded?
>>
>> Cheers,
>> Jens
>>
>>
>>
>> Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:
>>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>>
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
>> sitting at <25% CPU, doing mostly writes, and not showing any particular
>> long GC times/pauses. By all observed metrics the ring is healthy and
>> performing well.
>>
>>
>>
>> However, we are noticing a pretty consistent number of connection
>> timeouts coming from the messaging service between various pairs of nodes
>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>> timeouts per minute, usually between two pairs of nodes for several hours
>> at a time. It seems to occur for several hours at a time, then may stop or
>> move to other pairs of nodes in the ring. The metric
>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>> the nodes in the TotalTimeouts metric.
>>
>>
>>
>> Looking at the debug log typically shows a large number of messages like
>> the following on one of the nodes:
>>
>>
>>
>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>> (ttl 0)
>>
>> We have cross node timeouts enabled, but ntp is running on all nodes and
>> no node appears to have time drift.
>>
>>
>>
>> The network appears to be fine between nodes, with iperf tests showing
>> that we have a lot of headroom.
>>
>>
>>
>> Any thoughts on what to look for? Can we increase thread count/pool sizes
>> for the messaging service?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Mike
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <mi...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <mi...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>> --
>>
>> Jens Rantil
>> Backend Developer @ Tink
>>
>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>> For urgent matters you can reach me at +46-708-84 18 32.
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <mi...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>
>
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Mike Heffner <mi...@librato.com>.

Jeff,

Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
believe we've hit the bugs mentioned in earlier driver versions.

Mike

On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
> depending on your instance types / hypervisor choice, you may want to
> ensure you’re not seeing that bug.
>
>
>
> *From: *Mike Heffner <mi...@librato.com>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Friday, July 1, 2016 at 1:10 PM
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Cc: *Peter Norton <pc...@librato.com>
> *Subject: *Re: Ring connection timeouts with 2.2.6
>
>
>
> Jens,
>
>
>
> We haven't noticed any particular large GC operations or even persistently
> high GC times.
>
>
>
> Mike
>
>
>
> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se> wrote:
>
> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
>
>
> Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:
>
> One thing to add, if we do a rolling restart of the ring the timeouts
> disappear entirely for several hours and performance returns to normal.
> It's as if something is leaking over time, but we haven't seen any
> noticeable change in heap.
>
>
>
> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:
>
> Hi,
>
>
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
>
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
>
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
>
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
> (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
>
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
>
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
>
>
> Thanks,
>
>
>
> Mike
>
>
>
> --
>
>
>   Mike Heffner <mi...@librato.com>
>
>   Librato, Inc.
>
>
>
>
>
>
>
> --
>
>
>   Mike Heffner <mi...@librato.com>
>
>   Librato, Inc.
>
>
>
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>
>
>
>
>
> --
>
>
>   Mike Heffner <mi...@librato.com>
>
>   Librato, Inc.
>
>
>



-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Jeff Jirsa <je...@crowdstrike.com>.

AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – depending on your instance types / hypervisor choice, you may want to ensure you’re not seeing that bug.

From: Mike Heffner <mi...@librato.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Friday, July 1, 2016 at 1:10 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Cc: Peter Norton <pc...@librato.com>
Subject: Re: Ring connection timeouts with 2.2.6

Jens, 

We haven't noticed any particular large GC operations or even persistently high GC times.

Mike

On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se> wrote:

Hi,

Could it be garbage collection occurring on nodes that are more heavily loaded?

Cheers,
Jens

Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:

One thing to add, if we do a rolling restart of the ring the timeouts disappear entirely for several hours and performance returns to normal. It's as if something is leaking over time, but we haven't seen any noticeable change in heap.

On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:

Hi, 

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is sitting at <25% CPU, doing mostly writes, and not showing any particular long GC times/pauses. By all observed metrics the ring is healthy and performing well.

However, we are noticing a pretty consistent number of connection timeouts coming from the messaging service between various pairs of nodes in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts per minute, usually between two pairs of nodes for several hours at a time. It seems to occur for several hours at a time, then may stop or move to other pairs of nodes in the ring. The metric "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of the nodes in the TotalTimeouts metric.

Looking at the debug log typically shows a large number of messages like the following on one of the nodes:

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no node appears to have time drift.

The network appears to be fine between nodes, with iperf tests showing that we have a lot of headroom.

Any thoughts on what to look for? Can we increase thread count/pool sizes for the messaging service?

Thanks,

Mike

-- 

  Mike Heffner <mi...@librato.com>

  Librato, Inc.

-- 

  Mike Heffner <mi...@librato.com>

  Librato, Inc.

-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

-- 

  Mike Heffner <mi...@librato.com>

  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Mike Heffner <mi...@librato.com>.

Jens,

We haven't noticed any particular large GC operations or even persistently
high GC times.

Mike

On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <je...@tink.se> wrote:

> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
> Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:
>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:
>>
>>> Hi,
>>>
>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>> performing well.
>>>
>>> However, we are noticing a pretty consistent number of connection
>>> timeouts coming from the messaging service between various pairs of nodes
>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>> timeouts per minute, usually between two pairs of nodes for several hours
>>> at a time. It seems to occur for several hours at a time, then may stop or
>>> move to other pairs of nodes in the ring. The metric
>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>>> the nodes in the TotalTimeouts metric.
>>>
>>> Looking at the debug log typically shows a large number of messages like
>>> the following on one of the nodes:
>>>
>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>>>
>>> We have cross node timeouts enabled, but ntp is running on all nodes and
>>> no node appears to have time drift.
>>>
>>> The network appears to be fine between nodes, with iperf tests showing
>>> that we have a lot of headroom.
>>>
>>> Any thoughts on what to look for? Can we increase thread count/pool
>>> sizes for the messaging service?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> --
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner <mi...@librato.com>
>>   Librato, Inc.
>>
>> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>



-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: Ring connection timeouts with 2.2.6

Posted by Jens Rantil <je...@tink.se>.

Hi,

Could it be garbage collection occurring on nodes that are more heavily
loaded?

Cheers,
Jens

Den sön 26 juni 2016 05:22Mike Heffner <mi...@librato.com> skrev:

> One thing to add, if we do a rolling restart of the ring the timeouts
> disappear entirely for several hours and performance returns to normal.
> It's as if something is leaking over time, but we haven't seen any
> noticeable change in heap.
>
> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:
>
>> Hi,
>>
>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
>> sitting at <25% CPU, doing mostly writes, and not showing any particular
>> long GC times/pauses. By all observed metrics the ring is healthy and
>> performing well.
>>
>> However, we are noticing a pretty consistent number of connection
>> timeouts coming from the messaging service between various pairs of nodes
>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>> timeouts per minute, usually between two pairs of nodes for several hours
>> at a time. It seems to occur for several hours at a time, then may stop or
>> move to other pairs of nodes in the ring. The metric
>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>> the nodes in the TotalTimeouts metric.
>>
>> Looking at the debug log typically shows a large number of messages like
>> the following on one of the nodes:
>>
>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>>
>> We have cross node timeouts enabled, but ntp is running on all nodes and
>> no node appears to have time drift.
>>
>> The network appears to be fine between nodes, with iperf tests showing
>> that we have a lot of headroom.
>>
>> Any thoughts on what to look for? Can we increase thread count/pool sizes
>> for the messaging service?
>>
>> Thanks,
>>
>> Mike
>>
>> --
>>
>>   Mike Heffner <mi...@librato.com>
>>   Librato, Inc.
>>
>>
>
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
> --

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Re: Ring connection timeouts with 2.2.6

Posted by Mike Heffner <mi...@librato.com>.

One thing to add, if we do a rolling restart of the ring the timeouts
disappear entirely for several hours and performance returns to normal.
It's as if something is leaking over time, but we haven't seen any
noticeable change in heap.

On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mi...@librato.com> wrote:

> Hi,
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
> Thanks,
>
> Mike
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.