You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sotirios Delimanolis <so...@yahoo.com> on 2016/02/19 03:07:28 UTC

Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

We have a Cassandra cluster with 24 nodes. These nodes were running 2.0.16. 
While the nodes are in the ring and handling queries, we perform the upgrade to 2.1.12 as follows (more or less) one node at a time:
   
   - Stop the Cassandra process
   - Deploy jars, scripts, binaries, etc.
   - Start the Cassandra process

A few nodes into the upgrade, we start noticing that the majority of queries (mostly through Thrift) time out or report unavailable. Looking at system information, Cassandra GC time goes through the roof, which is what we assume causes the time outs.
Once all nodes are upgraded, the cluster stabilizes and no more (barely any) time outs occur. 
What could explain this? Does it have anything to do with how a 2.0 communicates with a 2.1?
Our Cassandra consumers haven't changed.





Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

Posted by Sotirios Delimanolis <so...@yahoo.com>.
We're not all the way there yet with native. But the increased GC time is temporary, only during the deployment. After all nodes are on 2.1, everything is smooth. 

    On Friday, February 19, 2016 1:47 PM, daemeon reiydelle <da...@gmail.com> wrote:
 

 FYI, my observations were with native, not thrift.


.......

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872

On Fri, Feb 19, 2016 at 10:12 AM, Sotirios Delimanolis <so...@yahoo.com> wrote:

Does your cluster contain 24+ nodes or fewer? 
We did the same upgrade on a smaller cluster of 5 nodes and we didn't see this behavior. On the 24 node cluster, the timeouts only took effect once ~5-6-7+ nodes had been upgraded.
We're doing some more upgrades next week, trying different deployment plans. I'll report back with the results.
Thanks for the reply (we absolutely want to move to CQL) 

    On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
 

 I performed this exact update a few days ago, excepted clients were using native protocol and it wen smoothly. So I think this might be thrift related. No idea what is producing this though, just wanted to give the info fwiw.
As a side note, unrelated to the issue, performances using native are a lot better than thrift starting in C* 2.1. Drivers using native are also more modern allowing you to do very interesting stuff. Updating to native now that you are using 2.1 is something you might want to do soon enough :-).
C*heers,-----------------Alain RodriguezFrance
The Last Picklehttp://www.thelastpickle.com
2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <so...@yahoo.com>:

We have a Cassandra cluster with 24 nodes. These nodes were running 2.0.16. 
While the nodes are in the ring and handling queries, we perform the upgrade to 2.1.12 as follows (more or less) one node at a time:
   
   - Stop the Cassandra process
   - Deploy jars, scripts, binaries, etc.
   - Start the Cassandra process

A few nodes into the upgrade, we start noticing that the majority of queries (mostly through Thrift) time out or report unavailable. Looking at system information, Cassandra GC time goes through the roof, which is what we assume causes the time outs.
Once all nodes are upgraded, the cluster stabilizes and no more (barely any) time outs occur. 
What could explain this? Does it have anything to do with how a 2.0 communicates with a 2.1?
Our Cassandra consumers haven't changed.








   



  

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

Posted by daemeon reiydelle <da...@gmail.com>.
FYI, my observations were with native, not thrift.


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Feb 19, 2016 at 10:12 AM, Sotirios Delimanolis <sotodel_89@yahoo.com
> wrote:

> Does your cluster contain 24+ nodes or fewer?
>
> We did the same upgrade on a smaller cluster of 5 nodes and we didn't see
> this behavior. On the 24 node cluster, the timeouts only took effect once
> ~5-6-7+ nodes had been upgraded.
>
> We're doing some more upgrades next week, trying different deployment
> plans. I'll report back with the results.
>
> Thanks for the reply (we absolutely want to move to CQL)
>
>
> On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>
> I performed this exact update a few days ago, excepted clients were using
> native protocol and it wen smoothly. So I think this might be thrift
> related. No idea what is producing this though, just wanted to give the
> info fwiw.
>
> As a side note, unrelated to the issue, performances using native are a
> lot better than thrift starting in C* 2.1. Drivers using native are also
> more modern allowing you to do very interesting stuff. Updating to native
> now that you are using 2.1 is something you might want to do soon enough
> :-).
>
> C*heers,
> -----------------
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <so...@yahoo.com>:
>
> We have a Cassandra cluster with 24 nodes. These nodes were running
> 2.0.16.
>
> While the nodes are in the ring and handling queries, we perform the
> upgrade to 2.1.12 as follows (more or less) one node at a time:
>
>
>    1. Stop the Cassandra process
>    2. Deploy jars, scripts, binaries, etc.
>    3. Start the Cassandra process
>
>
> A few nodes into the upgrade, we start noticing that the majority of
> queries (mostly through Thrift) time out or report unavailable. Looking at
> system information, Cassandra GC time goes through the roof, which is what
> we assume causes the time outs.
>
> Once all nodes are upgraded, the cluster stabilizes and no more (barely
> any) time outs occur.
>
> What could explain this? Does it have anything to do with how a 2.0
> communicates with a 2.1?
>
> Our Cassandra consumers haven't changed.
>
>
>
>
>
>
>
>
>

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

Posted by Sotirios Delimanolis <so...@yahoo.com>.
Does your cluster contain 24+ nodes or fewer? 
We did the same upgrade on a smaller cluster of 5 nodes and we didn't see this behavior. On the 24 node cluster, the timeouts only took effect once ~5-6-7+ nodes had been upgraded.
We're doing some more upgrades next week, trying different deployment plans. I'll report back with the results.
Thanks for the reply (we absolutely want to move to CQL) 

    On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
 

 I performed this exact update a few days ago, excepted clients were using native protocol and it wen smoothly. So I think this might be thrift related. No idea what is producing this though, just wanted to give the info fwiw.
As a side note, unrelated to the issue, performances using native are a lot better than thrift starting in C* 2.1. Drivers using native are also more modern allowing you to do very interesting stuff. Updating to native now that you are using 2.1 is something you might want to do soon enough :-).
C*heers,-----------------Alain RodriguezFrance
The Last Picklehttp://www.thelastpickle.com
2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <so...@yahoo.com>:

We have a Cassandra cluster with 24 nodes. These nodes were running 2.0.16. 
While the nodes are in the ring and handling queries, we perform the upgrade to 2.1.12 as follows (more or less) one node at a time:
   
   - Stop the Cassandra process
   - Deploy jars, scripts, binaries, etc.
   - Start the Cassandra process

A few nodes into the upgrade, we start noticing that the majority of queries (mostly through Thrift) time out or report unavailable. Looking at system information, Cassandra GC time goes through the roof, which is what we assume causes the time outs.
Once all nodes are upgraded, the cluster stabilizes and no more (barely any) time outs occur. 
What could explain this? Does it have anything to do with how a 2.0 communicates with a 2.1?
Our Cassandra consumers haven't changed.








  

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

Posted by daemeon reiydelle <da...@gmail.com>.
May be unrelated, but I found highly variable latency (latency max) when on
the 2.1 code tree loading new data (and reading). Others found that G1 or
CMS do not make a difference. Some evidence that 8/12/16gb memory make no
difference. These were latencies in the 10-30 SECOND range. It did cause
timeouts. You may not be seeing a 2.0 vs. 2.1 issue, rather a 2.1 issue
proper. While others did not find this associated with stop-the-world GC, I
saw some evidence of same (using Cassandra stress, but I recently reproduce
the issue with YCSB!)


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Feb 19, 2016 at 1:10 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> I performed this exact update a few days ago, excepted clients were using
> native protocol and it wen smoothly. So I think this might be thrift
> related. No idea what is producing this though, just wanted to give the
> info fwiw.
>
> As a side note, unrelated to the issue, performances using native are a
> lot better than thrift starting in C* 2.1. Drivers using native are also
> more modern allowing you to do very interesting stuff. Updating to native
> now that you are using 2.1 is something you might want to do soon enough
> :-).
>
> C*heers,
> -----------------
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <so...@yahoo.com>:
>
>> We have a Cassandra cluster with 24 nodes. These nodes were running
>> 2.0.16.
>>
>> While the nodes are in the ring and handling queries, we perform the
>> upgrade to 2.1.12 as follows (more or less) one node at a time:
>>
>>
>>    1. Stop the Cassandra process
>>    2. Deploy jars, scripts, binaries, etc.
>>    3. Start the Cassandra process
>>
>>
>> A few nodes into the upgrade, we start noticing that the majority of
>> queries (mostly through Thrift) time out or report unavailable. Looking at
>> system information, Cassandra GC time goes through the roof, which is what
>> we assume causes the time outs.
>>
>> Once all nodes are upgraded, the cluster stabilizes and no more (barely
>> any) time outs occur.
>>
>> What could explain this? Does it have anything to do with how a 2.0
>> communicates with a 2.1?
>>
>> Our Cassandra consumers haven't changed.
>>
>>
>>
>>
>>
>>
>

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
I performed this exact update a few days ago, excepted clients were using
native protocol and it wen smoothly. So I think this might be thrift
related. No idea what is producing this though, just wanted to give the
info fwiw.

As a side note, unrelated to the issue, performances using native are a lot
better than thrift starting in C* 2.1. Drivers using native are also more
modern allowing you to do very interesting stuff. Updating to native now
that you are using 2.1 is something you might want to do soon enough :-).

C*heers,
-----------------
Alain Rodriguez
France

The Last Pickle
http://www.thelastpickle.com

2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <so...@yahoo.com>:

> We have a Cassandra cluster with 24 nodes. These nodes were running
> 2.0.16.
>
> While the nodes are in the ring and handling queries, we perform the
> upgrade to 2.1.12 as follows (more or less) one node at a time:
>
>
>    1. Stop the Cassandra process
>    2. Deploy jars, scripts, binaries, etc.
>    3. Start the Cassandra process
>
>
> A few nodes into the upgrade, we start noticing that the majority of
> queries (mostly through Thrift) time out or report unavailable. Looking at
> system information, Cassandra GC time goes through the roof, which is what
> we assume causes the time outs.
>
> Once all nodes are upgraded, the cluster stabilizes and no more (barely
> any) time outs occur.
>
> What could explain this? Does it have anything to do with how a 2.0
> communicates with a 2.1?
>
> Our Cassandra consumers haven't changed.
>
>
>
>
>
>