You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Faraaz Sareshwala <fs...@quantcast.com> on 2013/08/06 21:52:25 UTC

Large number of pending gossip stage tasks in nodetool tpstats

I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
nodes are well behaved except one. Whenever I start this node, it starts
churning CPU. Running nodetool tpstats, I notice that the number of pending
gossip stage tasks is constantly increasing [1]. When looking at nodetool
gossipinfo, I notice that this node has updated to the latest schema hash, but
that it thinks other nodes in the cluster are on the older version. I've tried
to drain, decommission, wipe node data, bootstrap, and repair the node. However,
the node just started doing the same thing again.

Has anyone run into this issue before? Can anyone provide any insight into why
this node is the only one in the cluster having problems? Are there any easy
fixes?

Thank you,
Faraaz

[1] $ /cassandra/bin/nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0              8         0                 0
RequestResponseStage              0         0          49198         0                 0
MutationStage                     0         0         224286         0                 0
ReadRepairStage                   0         0              0         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       1      2213             18         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0             72         0                 0
MemtablePostFlusher               0         0            102         0                 0
FlushWriter                       0         0             99         0                 0
MiscStage                         0         0              0         0                 0
commitlog_archiver                0         0              0         0                 0
InternalResponseStage             0         0             19         0                 0
HintedHandoff                     0         0              2         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

Re: Large number of pending gossip stage tasks in nodetool tpstats

Posted by Faraaz Sareshwala <fs...@quantcast.com>.
And by that last statement, I mean are there any further things I should look for given the information in my response? I'll definitely look at implementing your suggestions and see what I can find.

On Aug 7, 2013, at 7:31 PM, "Faraaz Sareshwala" <fs...@quantcast.com> wrote:

> Thanks Aaron. The node that was behaving this way was a production node so I had to take some drastic measures to get it back to doing the right thing. It's no longer behaving this way after wiping the system tables and having cassandra resync the schema from other nodes. In hindsight, maybe I could have gotten away with a nodetool resetlocalschema. Since the node has been restored to a working state, I sadly can't run commands on it to investigate any longer.
> 
> When the node was in this hosed state, I did check nodetool gossipinfo. The bad node had the correct schema hash; the same as the rest of the nodes in the cluster. However, it thought every other node in the cluster had another schema hash, most likely the older one everyone migrated from.
> 
> This issue occurred again today on three machines so I feel it may occur again. Typically I see it when our entire datacenter updates it's configuration and restarts along an hour. All nodes point to the same list of seeds, but the restart order is random across one your. I'm not sure if this information helps at all.
> 
> Are there any specific things I should look for when it does occur again?
> 
> Thank you,
> Faraaz
> 
> On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aa...@thelastpickle.com> wrote:
> 
>>> When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>>> that it thinks other nodes in the cluster are on the older version.
>> What does describe cluster in cassandra-cli say ? It will let you know if there are multiple schema versions in the cluster. 
>> 
>> Can you include the output from nodetool gossipinfo ? 
>> 
>> You may also get some value from increase the log level for org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. It's unusual for only the gossip pool to backup. If there were issues with GC taking CPU we would expect to see it across the board. 
>> 
>> Cheers
>> 
>> 
>> 
>> -----------------
>> Aaron Morton
>> Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fs...@quantcast.com> wrote:
>> 
>>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
>>> nodes are well behaved except one. Whenever I start this node, it starts
>>> churning CPU. Running nodetool tpstats, I notice that the number of pending
>>> gossip stage tasks is constantly increasing [1]. When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>>> that it thinks other nodes in the cluster are on the older version. I've tried
>>> to drain, decommission, wipe node data, bootstrap, and repair the node. However,
>>> the node just started doing the same thing again.
>>> 
>>> Has anyone run into this issue before? Can anyone provide any insight into why
>>> this node is the only one in the cluster having problems? Are there any easy
>>> fixes?
>>> 
>>> Thank you,
>>> Faraaz
>>> 
>>> [1] $ /cassandra/bin/nodetool tpstats
>>> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
>>> ReadStage                         0         0              8         0                 0
>>> RequestResponseStage              0         0          49198         0                 0
>>> MutationStage                     0         0         224286         0                 0
>>> ReadRepairStage                   0         0              0         0                 0
>>> ReplicateOnWriteStage             0         0              0         0                 0
>>> GossipStage                       1      2213             18         0                 0
>>> AntiEntropyStage                  0         0              0         0                 0
>>> MigrationStage                    0         0             72         0                 0
>>> MemtablePostFlusher               0         0            102         0                 0
>>> FlushWriter                       0         0             99         0                 0
>>> MiscStage                         0         0              0         0                 0
>>> commitlog_archiver                0         0              0         0                 0
>>> InternalResponseStage             0         0             19         0                 0
>>> HintedHandoff                     0         0              2         0                 0
>>> 
>>> Message type           Dropped
>>> RANGE_SLICE                  0
>>> READ_REPAIR                  0
>>> BINARY                       0
>>> READ                         0
>>> MUTATION                     0
>>> _TRACE                       0
>>> REQUEST_RESPONSE             0
>> 

Re: Large number of pending gossip stage tasks in nodetool tpstats

Posted by Faraaz Sareshwala <fs...@quantcast.com>.
Thanks Aaron. The node that was behaving this way was a production node so I had to take some drastic measures to get it back to doing the right thing. It's no longer behaving this way after wiping the system tables and having cassandra resync the schema from other nodes. In hindsight, maybe I could have gotten away with a nodetool resetlocalschema. Since the node has been restored to a working state, I sadly can't run commands on it to investigate any longer.

When the node was in this hosed state, I did check nodetool gossipinfo. The bad node had the correct schema hash; the same as the rest of the nodes in the cluster. However, it thought every other node in the cluster had another schema hash, most likely the older one everyone migrated from.

This issue occurred again today on three machines so I feel it may occur again. Typically I see it when our entire datacenter updates it's configuration and restarts along an hour. All nodes point to the same list of seeds, but the restart order is random across one your. I'm not sure if this information helps at all.

Are there any specific things I should look for when it does occur again?

Thank you,
Faraaz

On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aa...@thelastpickle.com> wrote:

>> When looking at nodetool
>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>> that it thinks other nodes in the cluster are on the older version.
> What does describe cluster in cassandra-cli say ? It will let you know if there are multiple schema versions in the cluster. 
> 
> Can you include the output from nodetool gossipinfo ? 
> 
> You may also get some value from increase the log level for org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. It's unusual for only the gossip pool to backup. If there were issues with GC taking CPU we would expect to see it across the board. 
> 
> Cheers
> 
> 
> 
> -----------------
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fs...@quantcast.com> wrote:
> 
>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
>> nodes are well behaved except one. Whenever I start this node, it starts
>> churning CPU. Running nodetool tpstats, I notice that the number of pending
>> gossip stage tasks is constantly increasing [1]. When looking at nodetool
>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>> that it thinks other nodes in the cluster are on the older version. I've tried
>> to drain, decommission, wipe node data, bootstrap, and repair the node. However,
>> the node just started doing the same thing again.
>> 
>> Has anyone run into this issue before? Can anyone provide any insight into why
>> this node is the only one in the cluster having problems? Are there any easy
>> fixes?
>> 
>> Thank you,
>> Faraaz
>> 
>> [1] $ /cassandra/bin/nodetool tpstats
>> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
>> ReadStage                         0         0              8         0                 0
>> RequestResponseStage              0         0          49198         0                 0
>> MutationStage                     0         0         224286         0                 0
>> ReadRepairStage                   0         0              0         0                 0
>> ReplicateOnWriteStage             0         0              0         0                 0
>> GossipStage                       1      2213             18         0                 0
>> AntiEntropyStage                  0         0              0         0                 0
>> MigrationStage                    0         0             72         0                 0
>> MemtablePostFlusher               0         0            102         0                 0
>> FlushWriter                       0         0             99         0                 0
>> MiscStage                         0         0              0         0                 0
>> commitlog_archiver                0         0              0         0                 0
>> InternalResponseStage             0         0             19         0                 0
>> HintedHandoff                     0         0              2         0                 0
>> 
>> Message type           Dropped
>> RANGE_SLICE                  0
>> READ_REPAIR                  0
>> BINARY                       0
>> READ                         0
>> MUTATION                     0
>> _TRACE                       0
>> REQUEST_RESPONSE             0
> 

Re: Large number of pending gossip stage tasks in nodetool tpstats

Posted by Aaron Morton <aa...@thelastpickle.com>.
>  When looking at nodetool
> gossipinfo, I notice that this node has updated to the latest schema hash, but
> that it thinks other nodes in the cluster are on the older version.
What does describe cluster in cassandra-cli say ? It will let you know if there are multiple schema versions in the cluster. 

Can you include the output from nodetool gossipinfo ? 

You may also get some value from increase the log level for org.apache.cassandra.gms.Gossiper to DEBUG so you can see what's going on. It's unusual for only the gossip pool to backup. If there were issues with GC taking CPU we would expect to see it across the board. 

Cheers



-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fs...@quantcast.com> wrote:

> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
> nodes are well behaved except one. Whenever I start this node, it starts
> churning CPU. Running nodetool tpstats, I notice that the number of pending
> gossip stage tasks is constantly increasing [1]. When looking at nodetool
> gossipinfo, I notice that this node has updated to the latest schema hash, but
> that it thinks other nodes in the cluster are on the older version. I've tried
> to drain, decommission, wipe node data, bootstrap, and repair the node. However,
> the node just started doing the same thing again.
> 
> Has anyone run into this issue before? Can anyone provide any insight into why
> this node is the only one in the cluster having problems? Are there any easy
> fixes?
> 
> Thank you,
> Faraaz
> 
> [1] $ /cassandra/bin/nodetool tpstats
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> ReadStage                         0         0              8         0                 0
> RequestResponseStage              0         0          49198         0                 0
> MutationStage                     0         0         224286         0                 0
> ReadRepairStage                   0         0              0         0                 0
> ReplicateOnWriteStage             0         0              0         0                 0
> GossipStage                       1      2213             18         0                 0
> AntiEntropyStage                  0         0              0         0                 0
> MigrationStage                    0         0             72         0                 0
> MemtablePostFlusher               0         0            102         0                 0
> FlushWriter                       0         0             99         0                 0
> MiscStage                         0         0              0         0                 0
> commitlog_archiver                0         0              0         0                 0
> InternalResponseStage             0         0             19         0                 0
> HintedHandoff                     0         0              2         0                 0
> 
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> BINARY                       0
> READ                         0
> MUTATION                     0
> _TRACE                       0
> REQUEST_RESPONSE             0