You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Berenguer Blasi (Jira)" <ji...@apache.org> on 2020/10/21 09:52:00 UTC

[jira] [Comment Edited] (CASSANDRA-15865) Flaky dtest hintedhandoff_test.py::TestHintedHandoffConfig::test_hintedhandoff_setmaxwindow

    [ https://issues.apache.org/jira/browse/CASSANDRA-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218202#comment-17218202 ] 

Berenguer Blasi edited comment on CASSANDRA-15865 at 10/21/20, 9:51 AM:
------------------------------------------------------------------------

Some progress. The only failure's logs look like 

{noformat}
grep -ri "is now" node1_debug.log
INFO  [GossipStage:1] 2020-10-19 15:34:07,204 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:34:07,264 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:19,054 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:34:31,968 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,653 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:24,762 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP

grep -ri "is now" node2_debug.log
INFO  [GossipStage:1] 2020-10-19 15:34:07,930 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:34:07,950 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:31,988 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:45,688 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:04,633 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:24,791 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:25,316 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
{noformat}

whereas all the other million non-failing runs diplay
{noformat}
grep -ri "is now" node1_debug.log
INFO  [GossipStage:1] 2020-10-19 15:35:40,341 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:35:40,391 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:52,169 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:05,123 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:57,732 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP

grep -ri "is now" node2_debug.log
INFO  [GossipStage:1] 2020-10-19 15:35:41,061 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:35:41,094 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:05,150 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:18,807 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:37,710 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,728 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:57,756 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:58,660 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
{noformat}

We can see node2 behaves exactly the same whereas node1 displays an extra 'DOWN'. Matching timestamps this is what's going on in the logs for a bad run
{noformat}
DEBUG [GossipStage:1] 2020-10-19 15:35:11,626 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60)
INFO  [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:35:11,716 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000
DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1192 - removing expire time for endpoint : /127.0.0.2:7000
INFO  [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60)

DEBUG [GossipStage:1] 2020-10-19 15:35:23,368 Gossiper.java:1480 - Shadow request received, adding all states
DEBUG [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:471 - Convicting /127.0.0.2:7000 with status shutdown - alive true
INFO  [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:35:24,728 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000
DEBUG [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change
INFO  [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP
{noformat}

and this is what happens for a good run:
{noformat}
DEBUG [GossipStage:1] 2020-10-19 15:36:44,713 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (4f3d62af-22c5-3886-8c0b-c5f1b95fc403)
INFO  [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:36:44,960 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000

DEBUG [GossipStage:1] 2020-10-19 15:36:56,647 Gossiper.java:1480 - Shadow request received, adding all states
DEBUG [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change
INFO  [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP
{noformat}

Looking at {{Gossiper}} code and trying infer what's going on looks like node2 is reported UP/DOWN an extra time but I can't infer what's happening from only looking at the logs. I have added tracing level to the test and hope to repro overnight.

I am attaching a  zip [^logsGoodNBad.zip]  for the good and bad runs for the record.


was (Author: bereng):
Some progress. The only failure's logs look like 

{noformat}
grep -ri "is now" node1_debug.log
INFO  [GossipStage:1] 2020-10-19 15:34:07,204 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:34:07,264 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:19,054 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:34:31,968 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,653 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:24,762 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP

grep -ri "is now" node2_debug.log
INFO  [GossipStage:1] 2020-10-19 15:34:07,930 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:34:07,950 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:31,988 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:34:45,688 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:35:04,633 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:24,791 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:25,316 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
{noformat}

whereas all the other million non-failing runs diplay
{noformat}
grep -ri "is now" node1_debug.log
INFO  [GossipStage:1] 2020-10-19 15:35:40,341 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:35:40,391 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:35:52,169 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:05,123 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:57,732 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP

grep -ri "is now" node2_debug.log
INFO  [GossipStage:1] 2020-10-19 15:35:41,061 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster
INFO  [GossipStage:1] 2020-10-19 15:35:41,094 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:05,150 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:18,807 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
INFO  [GossipStage:1] 2020-10-19 15:36:37,710 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:37,728 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:57,756 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP
INFO  [GossipStage:1] 2020-10-19 15:36:58,660 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN
{noformat}

We can see node2 behaves exactly the same whereas node1 displays an extra 'DOWN'. Matching timestamps this is what's going on in the logs for a bad run
{noformat}
DEBUG [GossipStage:1] 2020-10-19 15:35:11,626 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60)
INFO  [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:35:11,716 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000
DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1192 - removing expire time for endpoint : /127.0.0.2:7000
INFO  [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP
DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60)

DEBUG [GossipStage:1] 2020-10-19 15:35:23,368 Gossiper.java:1480 - Shadow request received, adding all states
DEBUG [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:471 - Convicting /127.0.0.2:7000 with status shutdown - alive true
INFO  [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:35:24,728 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000
DEBUG [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change
INFO  [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP
{noformat}

and this is what happens for a good run:
{noformat}
DEBUG [GossipStage:1] 2020-10-19 15:36:44,713 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (4f3d62af-22c5-3886-8c0b-c5f1b95fc403)
INFO  [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN
DEBUG [GossipStage:1] 2020-10-19 15:36:44,960 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000

DEBUG [GossipStage:1] 2020-10-19 15:36:56,647 Gossiper.java:1480 - Shadow request received, adding all states
DEBUG [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change
INFO  [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP
{noformat}

Looking at {{Gossiper}} code and trying infer what's going on looks like node2 is reported UP/DOWN an extra time but I can't infer what's happening from only looking at the logs. I have added tracing level to the test and hope to repro overnight.

I am attaching a zip for the good and bad runs for the record.

> Flaky dtest hintedhandoff_test.py::TestHintedHandoffConfig::test_hintedhandoff_setmaxwindow
> -------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15865
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15865
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Sam Tunnicliffe
>            Assignee: Berenguer Blasi
>            Priority: Normal
>             Fix For: 4.0-beta
>
>         Attachments: logsGoodNBad.zip
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've seen this fail a couple of times under JDK11, when it doesn't appear to be related to the changes under test.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org