You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marcin Pietraszek <mp...@opera.com> on 2015/04/02 12:05:50 UTC

Cluster status instability

Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

-- 
mp

Re: Cluster status instability

Posted by Erik Forsberg <fo...@opera.com>.

To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems
to stay that way for a very long time (hours). I'm not even sure it will
recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that
thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for
all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we
have a lot of bulkloading going on. But why does it then stay for several
hours after the load goes down?
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12
about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own
datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there
are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all
nodes are up. Load on cluster right now is minimal, there's no GC going on.
Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN  2001:4c28:1:413:0:1:2:5   1.07 TB    256     1.8%
114ff46e-57d0-40dd-87fb-3e4259e96c16  rack2
DN  2001:4c28:1:413:0:1:2:6   1.06 TB    256     1.8%
b161a6f3-b940-4bba-9aa3-cfb0fc1fe759  rack2
DN  2001:4c28:1:413:0:1:2:13  896.82 GB  256     1.6%
4a488366-0db9-4887-b538-4c5048a6d756  rack2
DN  2001:4c28:1:413:0:1:3:7   1.04 TB    256     1.8%
95cf2cdb-d364-4b30-9b91-df4c37f3d670  rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is
down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
  generation:1427712750
  heartbeat:2310212
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack2
  LOAD:1.172524771195E12
  INTERNAL_IP:2001:4c28:1:413:0:1:2:5
  HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100493381707736523347375230104768602825
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
  generation:1427714889
  heartbeat:2305710
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack3
  LOAD:1.047542503234E12
  INTERNAL_IP:2001:4c28:1:413:0:1:3:12
  HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100163259989151698942931348962560111256
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made
04-05 mark it as up, with no success.

Please let me know what other debug information I can provide.

Regards,
\EF

On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle <da...@gmail.com>
wrote:

> Do you happen to be using a tool like Nagios or Ganglia that are able to
> report utilization (CPU, Load, disk io, network)? There are plugins for
> both that will also notify you of (depending on whether you enabled the
> intermediate GC logging) about what is happening.
>
>
>
> On Thu, Apr 2, 2015 at 8:35 AM, Jan <cn...@yahoo.com> wrote:
>
>> Marcin  ;
>>
>> are all your nodes within the same Region   ?
>> If not in the same region,   what is the Snitch type that you are using
>> ?
>>
>> Jan/
>>
>>
>>
>>   On Thursday, April 2, 2015 3:28 AM, Michal Michalski <
>> michal.michalski@boxever.com> wrote:
>>
>>
>> Hey Marcin,
>>
>> Are they actually going up and down repeatedly (flapping) or just down
>> and they never come back?
>> There might be different reasons for flapping nodes, but to list what I
>> have at the top of my head right now:
>>
>> 1. Network issues. I don't think it's your case, but you can read about
>> the issues some people are having when deploying C* on AWS EC2 (keyword to
>> look for: phi_convict_threshold)
>>
>> 2. Heavy load. Node is under heavy load because of massive number of
>> reads / writes / bulkloads or e.g. unthrottled compaction etc., which may
>> result in extensive GC.
>>
>> Could any of these be a problem in your case? I'd start from
>> investigating GC logs e.g. to see how long does the "stop the world" full
>> GC take (GC logs should be on by default from what I can see [1])
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319
>>
>> Michał
>>
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michalski@boxever.com
>>
>> On 2 April 2015 at 11:05, Marcin Pietraszek <mp...@opera.com>
>> wrote:
>>
>> Hi!
>>
>> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
>> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
>> one of those nodes starts to report that subset of other nodes is in
>> DN state although C* deamon on all nodes is running:
>>
>> A$ nodetool status
>> UN B
>> DN C
>> DN D
>> UN E
>>
>> B$ nodetool status
>> UN A
>> UN C
>> UN D
>> UN E
>>
>> C$ nodetool status
>> DN A
>> UN B
>> UN D
>> UN E
>>
>> After restart of A node, C and D report that A it's in UN and also A
>> claims that whole cluster is in UN state. Right now I don't have any
>> clear steps to reproduce that situation, do you guys have any idea
>> what could be causing such behaviour? How this could be prevented?
>>
>> It seems like when A node is a coordinator and gets request for some
>> data being replicated on C and D it respond with Unavailable
>> exception, after restarting A that problem disapears.
>>
>> --
>> mp
>>
>>
>>
>>
>>
>

Re: Cluster status instability

Posted by daemeon reiydelle <da...@gmail.com>.

Do you happen to be using a tool like Nagios or Ganglia that are able to
report utilization (CPU, Load, disk io, network)? There are plugins for
both that will also notify you of (depending on whether you enabled the
intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan <cn...@yahoo.com> wrote:

> Marcin  ;
>
> are all your nodes within the same Region   ?
> If not in the same region,   what is the Snitch type that you are using
> ?
>
> Jan/
>
>
>
>   On Thursday, April 2, 2015 3:28 AM, Michal Michalski <
> michal.michalski@boxever.com> wrote:
>
>
> Hey Marcin,
>
> Are they actually going up and down repeatedly (flapping) or just down and
> they never come back?
> There might be different reasons for flapping nodes, but to list what I
> have at the top of my head right now:
>
> 1. Network issues. I don't think it's your case, but you can read about
> the issues some people are having when deploying C* on AWS EC2 (keyword to
> look for: phi_convict_threshold)
>
> 2. Heavy load. Node is under heavy load because of massive number of reads
> / writes / bulkloads or e.g. unthrottled compaction etc., which may result
> in extensive GC.
>
> Could any of these be a problem in your case? I'd start from investigating
> GC logs e.g. to see how long does the "stop the world" full GC take (GC
> logs should be on by default from what I can see [1])
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-5319
>
> Michał
>
>
> Kind regards,
> Michał Michalski,
> michal.michalski@boxever.com
>
> On 2 April 2015 at 11:05, Marcin Pietraszek <mp...@opera.com> wrote:
>
> Hi!
>
> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
> one of those nodes starts to report that subset of other nodes is in
> DN state although C* deamon on all nodes is running:
>
> A$ nodetool status
> UN B
> DN C
> DN D
> UN E
>
> B$ nodetool status
> UN A
> UN C
> UN D
> UN E
>
> C$ nodetool status
> DN A
> UN B
> UN D
> UN E
>
> After restart of A node, C and D report that A it's in UN and also A
> claims that whole cluster is in UN state. Right now I don't have any
> clear steps to reproduce that situation, do you guys have any idea
> what could be causing such behaviour? How this could be prevented?
>
> It seems like when A node is a coordinator and gets request for some
> data being replicated on C and D it respond with Unavailable
> exception, after restarting A that problem disapears.
>
> --
> mp
>
>
>
>
>

Re: Cluster status instability

Posted by Jan <cn...@yahoo.com>.

Marcin ;
are all your nodes within the same Region ? If not in the same region, what is the Snitch type that you are using ?
Jan/

On Thursday, April 2, 2015 3:28 AM, Michal Michalski <mi...@boxever.com> wrote:

Hey Marcin,
Are they actually going up and down repeatedly (flapping) or just down and they never come back?There might be different reasons for flapping nodes, but to list what I have at the top of my head right now:
1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold)
2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC.
Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the "stop the world" full GC take (GC logs should be on by default from what I can see [1])
[1] https://issues.apache.org/jira/browse/CASSANDRA-5319
Michał

Kind regards,Michał Michalski,michal.michalski@boxever.com
On 2 April 2015 at 11:05, Marcin Pietraszek <mp...@opera.com> wrote:

Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp

Re: Cluster status instability

Posted by Michal Michalski <mi...@boxever.com>.

Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and
they never come back?
There might be different reasons for flapping nodes, but to list what I
have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the
issues some people are having when deploying C* on AWS EC2 (keyword to look
for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads
/ writes / bulkloads or e.g. unthrottled compaction etc., which may result
in extensive GC.

Could any of these be a problem in your case? I'd start from investigating
GC logs e.g. to see how long does the "stop the world" full GC take (GC
logs should be on by default from what I can see [1])

[1] https://issues.apache.org/jira/browse/CASSANDRA-5319

Michał

Kind regards,
Michał Michalski,
michal.michalski@boxever.com

On 2 April 2015 at 11:05, Marcin Pietraszek <mp...@opera.com> wrote:

> Hi!
>
> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
> one of those nodes starts to report that subset of other nodes is in
> DN state although C* deamon on all nodes is running:
>
> A$ nodetool status
> UN B
> DN C
> DN D
> UN E
>
> B$ nodetool status
> UN A
> UN C
> UN D
> UN E
>
> C$ nodetool status
> DN A
> UN B
> UN D
> UN E
>
> After restart of A node, C and D report that A it's in UN and also A
> claims that whole cluster is in UN state. Right now I don't have any
> clear steps to reproduce that situation, do you guys have any idea
> what could be causing such behaviour? How this could be prevented?
>
> It seems like when A node is a coordinator and gets request for some
> data being replicated on C and D it respond with Unavailable
> exception, after restarting A that problem disapears.
>
> --
> mp
>