You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "BugFinder (Jira)" <ji...@apache.org> on 2022/06/10 15:34:00 UTC

[jira] [Created] (CASSANDRA-17691) Gossip/Decommission tasklock contention on large clusters

BugFinder created CASSANDRA-17691:
-------------------------------------

             Summary: Gossip/Decommission tasklock contention on large clusters
                 Key: CASSANDRA-17691
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17691
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip, Cluster/Membership
            Reporter: BugFinder


Hi,

I am a researcher working on finding scale issues in distributed systems. I have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958) holds the tasklock that could end up in the invocation of getAddressRepplicas, like this (format is [method][lineNumber]):

{{[org.apache.cassandra.gms.Gossiper.addLocalApplicationStates]}}
{{*Type=EXPLICIT_LOCK, start=1960, end=1970*}}
{{  [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]}}
{{    [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]}}
{{      [org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]}}
{{        [org.apache.cassandra.service.StorageService.onChange][1551]}}
{{          [org.apache.cassandra.service.StorageService.handleStateRemoving][2308]}}
{{            [org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]}}
{{              [org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]}}
{{                [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]}}
{{                  [org.apache.cassandra.locator.AbstractReplicationStrategy.{*}getAddressReplicas{*}][284]}}
{{                  *[line=243, dimensions=[Peers * Tokens]]*}}

 

This seems to be affecting decommission path and the complexity is at least dependent on the number of tokens and peers in the cluster, thus when decommissioning a node with a large number of peers and tokens this path will end up holding the Gossiper's task lock for a long time.

This is likely to be affecting other 4.x versions too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org