You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Peter Haggerty <pe...@librato.com> on 2014/10/29 01:08:12 UTC

2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

On a 3 node test cluster we recently upgraded one node from 2.0.10 to
2.0.11. This is a cluster that had been happily running 2.0.10 for
weeks and that has very little load and very capable hardware. The
upgrade was just your typical package upgrade:

$ dpkg -s cassandra | egrep '^Ver|^Main'
Maintainer: Eric Evans <ee...@apache.org>
Version: 2.0.11

Immediately after started it ran a couple of ParNews and then started
executing CMS runs. In 10 minutes the node had become unreachable and
was marked as down by the two other nodes in the ring, which are still
2.0.10.

We have jstack output and the server logs but nothing seems to be
jumping out. Has anyone else run into this? What should we be looking
for?


Peter

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Dec 29, 2014 at 3:24 PM, mck <mi...@apache.org> wrote:

>
> Especially in CASSANDRA-6285 i see some scary stuff went down.
>
> But there are no outstanding bugs that we know of, are there?
>

Right, the question is whether you believe that 6285 has actually been
fully resolved.

It's relatively plausible that it finally was, which is why I describe my
feelings about HSHA "corrupter" implementation as FUD. Really the huge
mistake was to rewrite "hsha" despite the fact that this is one of the rare
pluggable interfaces, and thereby breaking existing users. If it had been
called "hsha2" or something, I'd have a lot less FUD about it... because
people would not have corrupted on upgrade, which I view as Super Bad.

IMO, probably the only people who should use HSHA are people who have a
real need for it, specifically people with huge numbers of client threads
they can't reduce.

=Rob

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by mck <mi...@apache.org>.

> Perf is better, correctness seems less so. I value latter more than
> former.


Yeah no doubt.
Especially in CASSANDRA-6285 i see some scary stuff went down.

But there are no outstanding bugs that we know of, are there? 
 (CASSANDRA-6815 remains just a wrap up of how options are to be
 presented in cassandra.yaml?)

~mck

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Dec 29, 2014 at 2:03 PM, mck <mi...@apache.org> wrote:

> We saw an improvement when we switched to HSHA, particularly for our
> offline (hadoop/spark) nodes.
> Sorry i don't have the data anymore to support that statement, although
> i can say that improvement paled in comparison to cross_node_timeout
> which we enabled shortly afterwards.
>

Perf is better, correctness seems less so. I value latter more than former.

=Rob

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by mck <mi...@apache.org>.

> > Should I stick to 2048 or try
> > with something closer to 128 or even something else ?


2048 worked fine for us.


> > About HSHA,
> 
> I anti-recommend hsha, serious apparently unresolved problems exist with
> it.


We saw an improvement when we switched to HSHA, particularly for our
offline (hadoop/spark) nodes.
Sorry i don't have the data anymore to support that statement, although
i can say that improvement paled in comparison to cross_node_timeout
which we enabled shortly afterwards.

~mck

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Dec 29, 2014 at 2:29 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Sorry about the gravedigging, but what would be a good start value to tune
> "rpc_max_threads" ?
>

Depends on whether you prefer that clients get a slow thread or none.

> I mean, default is unlimited, the value commented is 2048. Native protocol
> seems to only allow 128 simultaneous threads. Should I stick to 2048 or try
> with something closer to 128 or even something else ?
>

Probably closer to 2048 than unlimited.

> About HSHA, I have tried this mode from time to time since C* 0.8 and
> always faced the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are
> you using TFramedTransport on the client side?" error)". I haven't try for
> a while (1 year maybe), has this been fixed, or is this due to my
> configuration somehow ?
>

I anti-recommend hsha, serious apparently unresolved problems exist with
it. I understand this is FUD, but fool me once shame on you/fool me twice
shame on me.

=Rob

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi,

Sorry about the gravedigging, but what would be a good start value to tune "
rpc_max_threads" ?

I mean, default is unlimited, the value commented is 2048. Native protocol
seems to only allow 128 simultaneous threads. Should I stick to 2048 or try
with something closer to 128 or even something else ?

About HSHA, I have tried this mode from time to time since C* 0.8 and
always faced the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are
you using TFramedTransport on the client side?" error)". I haven't try for
a while (1 year maybe), has this been fixed, or is this due to my
configuration somehow ?

C*heers

Alain

2014-10-29 16:07 GMT+01:00 Peter Haggerty <pe...@librato.com>:

> That definitely appears to be the issue. Thanks for pointing that out!
>
> https://issues.apache.org/jira/browse/CASSANDRA-8116
> It looks like 2.0.12 will check for the default and throw an exception
> (thanks Mike Adamson) and also includes a bit more text in the config
> file but I'm thinking that 2.0.12 should be pushed out sooner rather
> than later as anyone using hsha and the default settings will simply
> have their cluster stop working a few minutes after the upgrade and
> without any indication of the actual problem.
>
>
> Peter
>
>
> On Wed, Oct 29, 2014 at 5:23 AM, Duncan Sands <du...@gmail.com>
> wrote:
> > Hi Peter, are you using the hsha RPC server type on this node?  If you
> are,
> > then it looks like rpc_max_threads threads will be allocated on startup
> in
> > 2.0.11 while this wasn't the case before.  This can exhaust your heap if
> the
> > value of rpc_max_threads is too large (eg if you use the default).
> >
> > Ciao, Duncan.
> >
> >
> > On 29/10/14 01:08, Peter Haggerty wrote:
> >>
> >> On a 3 node test cluster we recently upgraded one node from 2.0.10 to
> >> 2.0.11. This is a cluster that had been happily running 2.0.10 for
> >> weeks and that has very little load and very capable hardware. The
> >> upgrade was just your typical package upgrade:
> >>
> >> $ dpkg -s cassandra | egrep '^Ver|^Main'
> >> Maintainer: Eric Evans <ee...@apache.org>
> >> Version: 2.0.11
> >>
> >> Immediately after started it ran a couple of ParNews and then started
> >> executing CMS runs. In 10 minutes the node had become unreachable and
> >> was marked as down by the two other nodes in the ring, which are still
> >> 2.0.10.
> >>
> >> We have jstack output and the server logs but nothing seems to be
> >> jumping out. Has anyone else run into this? What should we be looking
> >> for?
> >>
> >>
> >> Peter
> >>
> >
>

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Peter Haggerty <pe...@librato.com>.

That definitely appears to be the issue. Thanks for pointing that out!

https://issues.apache.org/jira/browse/CASSANDRA-8116
It looks like 2.0.12 will check for the default and throw an exception
(thanks Mike Adamson) and also includes a bit more text in the config
file but I'm thinking that 2.0.12 should be pushed out sooner rather
than later as anyone using hsha and the default settings will simply
have their cluster stop working a few minutes after the upgrade and
without any indication of the actual problem.


Peter


On Wed, Oct 29, 2014 at 5:23 AM, Duncan Sands <du...@gmail.com> wrote:
> Hi Peter, are you using the hsha RPC server type on this node?  If you are,
> then it looks like rpc_max_threads threads will be allocated on startup in
> 2.0.11 while this wasn't the case before.  This can exhaust your heap if the
> value of rpc_max_threads is too large (eg if you use the default).
>
> Ciao, Duncan.
>
>
> On 29/10/14 01:08, Peter Haggerty wrote:
>>
>> On a 3 node test cluster we recently upgraded one node from 2.0.10 to
>> 2.0.11. This is a cluster that had been happily running 2.0.10 for
>> weeks and that has very little load and very capable hardware. The
>> upgrade was just your typical package upgrade:
>>
>> $ dpkg -s cassandra | egrep '^Ver|^Main'
>> Maintainer: Eric Evans <ee...@apache.org>
>> Version: 2.0.11
>>
>> Immediately after started it ran a couple of ParNews and then started
>> executing CMS runs. In 10 minutes the node had become unreachable and
>> was marked as down by the two other nodes in the ring, which are still
>> 2.0.10.
>>
>> We have jstack output and the server logs but nothing seems to be
>> jumping out. Has anyone else run into this? What should we be looking
>> for?
>>
>>
>> Peter
>>
>

Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm

Posted by Duncan Sands <du...@gmail.com>.

Hi Peter, are you using the hsha RPC server type on this node?  If you are, then 
it looks like rpc_max_threads threads will be allocated on startup in 2.0.11 
while this wasn't the case before.  This can exhaust your heap if the value of 
rpc_max_threads is too large (eg if you use the default).

Ciao, Duncan.

On 29/10/14 01:08, Peter Haggerty wrote:
> On a 3 node test cluster we recently upgraded one node from 2.0.10 to
> 2.0.11. This is a cluster that had been happily running 2.0.10 for
> weeks and that has very little load and very capable hardware. The
> upgrade was just your typical package upgrade:
>
> $ dpkg -s cassandra | egrep '^Ver|^Main'
> Maintainer: Eric Evans <ee...@apache.org>
> Version: 2.0.11
>
> Immediately after started it ran a couple of ParNews and then started
> executing CMS runs. In 10 minutes the node had become unreachable and
> was marked as down by the two other nodes in the ring, which are still
> 2.0.10.
>
> We have jstack output and the server logs but nothing seems to be
> jumping out. Has anyone else run into this? What should we be looking
> for?
>
>
> Peter
>