You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (Jira)" <ji...@apache.org> on 2021/04/26 20:57:00 UTC
[jira] [Commented] (CASSANDRA-16238) Fix flaky jvm-dtests that fail with Unable to contact any seeds

    [ https://issues.apache.org/jira/browse/CASSANDRA-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332737#comment-17332737 ] 

Brandon Williams commented on CASSANDRA-16238:
----------------------------------------------

Branch [here|https://github.com/driftx/cassandra/tree/CASSANDRA-16238].

[!https://ci-cassandra.apache.org/job/Cassandra-devbranch/714/badge/icon!|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/714/pipeline]

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,381 Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,381 StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token [-3074457345618258603]
[junit-timeout] INFO  [node4_GossipTasks:1] node4 2021-04-23 16:23:05,393 Gossiper.java:997 - FatClient /127.0.0.1:7012 has been silent for 0ms, removing from gossip
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,398 StorageService.java:2677 - New node /127.0.0.1:7012 at token -3074457345618258603
{noformat}

We can see here that .1 is detected for the first time via gossip, and as it is going through StorageService but before it is added to TokenMetatadata, the gossiper's status check has begun running.  Since the quarantine delay is overridden to zero, without a presence in TMD the node is not a member yet and thus deemed a fat client, and removed.

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,407 TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,414 TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,422 StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token [-3074457345618258603]
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,422 StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 Gossiper.java:579 - removed /127.0.0.1:7012 from seeds, updated seeds list = []
[junit-timeout] WARN  16:23:05 Seeds list is now empty!
[junit-timeout] WARN  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 Gossiper.java:581 - Seeds list is now empty!
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452 Gossiper.java:590 - removing endpoint /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452 Gossiper.java:561 - evicting /127.0.0.1:7012 from gossip
[junit-timeout] DEBUG [node4_GossipTasks:1] node4 2021-04-23 16:23:06,453 Gossiper.java:1025 - 0 elapsed, /127.0.0.1:7012 gossip quarantine over
{noformat}

Crucially, as part of this removal the node is also removed from the seeds list, since it is listed there. The warning about the empty seed list is added from my branch.

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,176 Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,177 StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token [-3074457345618258603]
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,189 StorageService.java:2677 - New node /127.0.0.1:7012 at token -3074457345618258603
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,198 TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,201 TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,208 StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token [-3074457345618258603]
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,208 StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,220 Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,220 Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] DEBUG [node4_BatchlogTasks:1] node4 2021-04-23 16:23:07,383 BatchlogManager.java:246 - Updating batchlog replay throttle to 1024 KB/s, 256 KB/s per endpoint
[junit-timeout] DEBUG [node4_isolatedExecutor:1] node4 2021-04-23 16:23:12,457 Gossiper.java:2142 - Gossip looks settled.
{noformat}

Another round of gossip mostly papers over this problem, except:

{noformat}
[junit-timeout] Unable to contact any seeds: []
[junit-timeout] java.lang.IllegalStateException: Unable to contact any seeds: []
[junit-timeout]         at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1752)
[junit-timeout]         at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1054)
[junit-timeout]         at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1015)
[junit-timeout]         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:799)
[junit-timeout]         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
[junit-timeout]         at org.apache.cassandra.distributed.impl.Instance.lambda$startup$10(Instance.java:541)
[junit-timeout]         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[junit-timeout]         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[junit-timeout]         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[junit-timeout]         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[junit-timeout]         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]         at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

There's (correctly) no logic to add a removed node back as a seed.  This branch also prints the seeds list when this happens to disambiguate between problems contacting seeds, and problems like this where the problem is not having any seeds left to contact.

Ultimately, quarantine delay can't be set this low.  I tried low non-zero values, but it needs to be set to at least one gossiper interval to avoid not accidentally reaping nodes as fat clients prematurely, and since ring delay was already being overridden and quarantine delay derived from that, I just removed the quarantine delay override.

> Fix flaky jvm-dtests that fail with Unable to contact any seeds
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-16238
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16238
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: David Capwell
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0-rc
>
>         Attachments: 16238-archived-failures.txt
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/745/workflows/1c7e589e-b5af-4a56-b40a-43da424602c7/jobs/4231
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: [ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
> 	at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
> 	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
> 	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
> 	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
> 	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
> 	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
> 	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
> 	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795), ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
> 	at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
> 	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
> 	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
> 	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
> 	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
> 	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
> 	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
> 	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org