You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yifan Cai (Jira)" <ji...@apache.org> on 2020/03/10 02:56:00 UTC

[jira] [Comment Edited] (CASSANDRA-15338) Fix flakey testMessagePurging - org.apache.cassandra.net.ConnectionTest

    [ https://issues.apache.org/jira/browse/CASSANDRA-15338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055533#comment-17055533 ] 

Yifan Cai edited comment on CASSANDRA-15338 at 3/10/20, 2:55 AM:
-----------------------------------------------------------------

The test failure was not able to be reproduced when simply running it from my laptop. 
 
However, it can be easily reproduced when running in a docker container with limited CPUs (i.e., 2). 
 
After multiple runs, the observation was that the test runs only failed when testing with LargeMessage. It indicated that the failures were probably related with {{LargeMessageDelivery}}. 
 
The following is what I think have happened. 
# When the {{inbound}} just opened and the first message gets queued into the {{outbound}}, handshake happens and the execution was deferred once the connection was established (executeAgain). 
# Since enqueue is not blocking, the next line, {{unsafeRunOnDelivery}} runs immediately. The effect is that the runnable gets registered, but not run yet. 
# Connection is established, so we {{executeAgain()}}. Because the runnable {{stopAndRun}} is present, and at this point, the {{inProgress}} flag is still false. The test runs the runnable, which counts down {{deliveryDone}} unexpectedly. 
# Delivery proceeds to flush the message. In {{LargeMessageDelivery}}, the flush is async and race condition can happen.
   ## when the inbound has received message (and countdown receiveDone)
   ## {{LargeMessageDelivery}} is still polling for the completion of flush, so not yet release capacity. 

Therefore, the assertion on the pendingCount failed. 
 
There are 2 places in the test flow are (or can go) wrong. See step 3 and step 4. 

Regarding step 3, the runnable {{stopAndRun}} should not be registered when establishing the connection. In production, is there a case that a {{stopAndRun}} being registered this early? Probably not.

Regarding step 4, the {{outbound}} has no knowledge about whether the {{inbound}} has received any message. Test should register the runnable {{stopAndRun}} at the message handler to count down the {{deliveryDone}}. Therefore, the runnable can correctly wait for the current delivery to complete. Then it runs. 
 
PR is here: https://github.com/apache/cassandra/pull/466

As mentioned, I reproduced using the docker. Here is the bundle that one can simply download and run.  [^CASS-15338-Docker.zip] It runs {{ConnectionTest}} repeatedly until failures.
I have included the patch within the image too. 

To reproduce, run
{code:bash}
bash build_and_run.sh
{code}

To see the runs with the patch, run
{code:bash}
bash build_and_run.sh patched
{code}


was (Author: yifanc):
The test failure was not able to reproduce when simply running it from my laptop. 
 
However, it can be easily reproduced when running in a docker container with limited CPUs (i.e., 2). 
 
After multiple runs, the observation was that the test runs only failed when testing with LargeMessage. It indicated that the failures were probably related with {{LargeMessageDelivery}}. 
 
The following is what I think have happened. 
# When the {{inbound}} just opened and the first message gets queued into the {{outbound}}, handshake happens and the execution was deferred once the connection was established (executeAgain). 
# Since enqueue is not blocking, the next line, {{unsafeRunOnDelivery}} runs immediately. The effect is that the runnable gets registered, but not run yet. 
# Connection is established, so we {{executeAgain()}}. Because the runnable {{stopAndRun}} is present, and at this point, the {{inProgress}} flag is still false. The test runs the runnable, which counts down {{deliveryDone}} unexpectedly. 
# Delivery proceeds to flush the message. In {{LargeMessageDelivery}}, the flush is async and race condition can happen.
   ## when the inbound has received message (and countdown receiveDone)
   ## {{LargeMessageDelivery}} is still polling for the completion of flush, so not yet release capacity. 

Therefore, the assertion on the pendingCount failed. 
 
There are 2 places in the test flow are (or can go) wrong. See step 3 and step 4. 

Regarding step 3, the runnable {{stopAndRun}} should not be registered when establishing the connection. In production, is there a case that a {{stopAndRun}} being registered this early? Probably not.

Regarding step 4, the {{outbound}} has no knowledge about whether the {{inbound}} has received any message. Test should register the runnable {{stopAndRun}} at the message handler to count down the {{deliveryDone}}. Therefore, the runnable can correctly wait for the current delivery to complete. Then it runs. 
 
PR is here: https://github.com/apache/cassandra/pull/466

As mentioned, I reproduced using the docker. Here is the bundle that one can simply download and run.  [^CASS-15338-Docker.zip] It runs {{ConnectionTest}} repeatedly until failures.
I have included the patch within the image too. 

To reproduce, run
{code:bash}
bash build_and_run.sh
{code}

To see the runs with the patch, run
{code:bash}
bash build_and_run.sh patched
{code}

> Fix flakey testMessagePurging - org.apache.cassandra.net.ConnectionTest
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-15338
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15338
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: David Capwell
>            Assignee: Yifan Cai
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-alpha
>
>         Attachments: CASS-15338-Docker.zip
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Example failure: [https://circleci.com/gh/dcapwell/cassandra/11#artifacts/containers/1]
>   
> {code:java}
> Testcase: testMessagePurging(org.apache.cassandra.net.ConnectionTest):  FAILED
>  expected:<0> but was:<1>
>  junit.framework.AssertionFailedError: expected:<0> but was:<1>
>    at org.apache.cassandra.net.ConnectionTest.lambda$testMessagePurging$38(ConnectionTest.java:625)
>    at org.apache.cassandra.net.ConnectionTest.doTestManual(ConnectionTest.java:258)
>    at org.apache.cassandra.net.ConnectionTest.testManual(ConnectionTest.java:231)
>    at org.apache.cassandra.net.ConnectionTest.testMessagePurging(ConnectionTest.java:584){code}
>   
>  Looking closer at org.apache.cassandra.net.OutboundConnection.Delivery#stopAndRun it seems that the run method is called before org.apache.cassandra.net.OutboundConnection.Delivery#doRun which may lead to a test race condition where the CountDownLatch completes before executing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org