You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Michael Fong <mi...@ruckuswireless.com> on 2016/03/23 09:11:42 UTC

Gossip heartbeat and packet capture

Hi, all,


We are trying to reason the possible scenarios when a C*(v1.x) cluster connection keeps flapping in production. (Two node cluster, each node keeps marking the other node DOWN but came back UP within seconds; multiple times a day) We have checked the load on the cluster i- very light and low GC activities also. We have also checked the network interface / devices were working just fine on the nodes during the incidence. We are changing our investigation direction to the network topology/settings, so we are thinking to capture gossip heartbeat packet to verify if the packet is received as expected on the other end.

Has anyone tried to capture the packet of gossip internode communication? What would be the filter / criteria to grep heartbeat-related packet only?

Thanks in advance!


Michael

RE: Gossip heartbeat and packet capture

Posted by Michael Fong <mi...@ruckuswireless.com>.
Hi,

Thanks for your comment.

The Cassandra version is in 1.2.15, and we have also adjusted the phi_convict_threshold to 12 in production. This setting works great in most of the production cases, except for this particular one...
Also, adding more node is not a plausible option for now. :(

After going through the source code of C* 1.2.15, the serialized gossip heartbeat message (SYN/ACK/ACK2) seem to contain 1. Cluster name, 2. Partitioner name in the payload. Perhaps I could grep the gossip heartbeat packet by filtering this criteria?

Sincerely,

Michael Fong

From: SEAN_R_DURITY@homedepot.com [mailto:SEAN_R_DURITY@homedepot.com]
Sent: Wednesday, March 23, 2016 10:39 PM
To: user@cassandra.apache.org
Subject: RE: Gossip heartbeat and packet capture

Is this from the 1.1 line, perhaps? In my experience it could be very flappy for no particular reason we could discover. 1.1 is a pretty dusty version. Upgrading into the 2.1 or later would be a good idea. If you have to upgrade in place without down time, you will need to go through many upgrades to get to the latest versions. (The late versions of 1.2 are pretty stable, though.)

Also, you can look at the phi_convict_threshold Cassandra.yaml parameter to help with network flakiness. Default is 8, I believe in most versions. Increasing to 10 or 12 might help.

Finally, a 2 node cluster doesn't give you many of the benefits of a distributed system. Consider adding some nodes, if you can.

Sean Durity - Lead Cassandra Admin
Big DATA Team
For support, create a JIRA<https://portal.homedepot.com/sites/bigdata/Shared%20Documents/Jira%20Hadoop%20Support%20Workflow.pdf>

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Wednesday, March 23, 2016 4:12 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Gossip heartbeat and packet capture

Hi, all,


We are trying to reason the possible scenarios when a C*(v1.x) cluster connection keeps flapping in production. (Two node cluster, each node keeps marking the other node DOWN but came back UP within seconds; multiple times a day) We have checked the load on the cluster i- very light and low GC activities also. We have also checked the network interface / devices were working just fine on the nodes during the incidence. We are changing our investigation direction to the network topology/settings, so we are thinking to capture gossip heartbeat packet to verify if the packet is received as expected on the other end.

Has anyone tried to capture the packet of gossip internode communication? What would be the filter / criteria to grep heartbeat-related packet only?

Thanks in advance!


Michael

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

RE: Gossip heartbeat and packet capture

Posted by SE...@homedepot.com.
Is this from the 1.1 line, perhaps? In my experience it could be very flappy for no particular reason we could discover. 1.1 is a pretty dusty version. Upgrading into the 2.1 or later would be a good idea. If you have to upgrade in place without down time, you will need to go through many upgrades to get to the latest versions. (The late versions of 1.2 are pretty stable, though.)

Also, you can look at the phi_convict_threshold Cassandra.yaml parameter to help with network flakiness. Default is 8, I believe in most versions. Increasing to 10 or 12 might help.

Finally, a 2 node cluster doesn't give you many of the benefits of a distributed system. Consider adding some nodes, if you can.

Sean Durity - Lead Cassandra Admin
Big DATA Team
For support, create a JIRA<https://portal.homedepot.com/sites/bigdata/Shared%20Documents/Jira%20Hadoop%20Support%20Workflow.pdf>

From: Michael Fong [mailto:michael.fong@ruckuswireless.com]
Sent: Wednesday, March 23, 2016 4:12 AM
To: user@cassandra.apache.org
Subject: Gossip heartbeat and packet capture

Hi, all,


We are trying to reason the possible scenarios when a C*(v1.x) cluster connection keeps flapping in production. (Two node cluster, each node keeps marking the other node DOWN but came back UP within seconds; multiple times a day) We have checked the load on the cluster i- very light and low GC activities also. We have also checked the network interface / devices were working just fine on the nodes during the incidence. We are changing our investigation direction to the network topology/settings, so we are thinking to capture gossip heartbeat packet to verify if the packet is received as expected on the other end.

Has anyone tried to capture the packet of gossip internode communication? What would be the filter / criteria to grep heartbeat-related packet only?

Thanks in advance!


Michael

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.