You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (Jira)" <ji...@apache.org> on 2021/05/26 19:24:00 UTC

[jira] [Comment Edited] (CASSANDRA-16677) Fix flaky ConnectionTest.testMessageDeliveryOnReconnect

    [ https://issues.apache.org/jira/browse/CASSANDRA-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352030#comment-17352030 ] 

Brandon Williams edited comment on CASSANDRA-16677 at 5/26/21, 7:23 PM:
------------------------------------------------------------------------

I'm giving up on this for now, as it can only damage my calm further to continue.

What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed:

{noformat}
[junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
[junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259)
[junit-timeout]         at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout]         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]         at java.lang.Thread.run(Thread.java:748)
{noformat}

if you add a test for the channel being open before this, it passes.  Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all.  Most of the failures occur with LARGE_MESSAGES, but given enough attempts there is occasionally one in SMALL_MESSAGES.

Then there is the unique stack that Berenguer linked, which I have yet to encounter myself.  It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)


was (Author: brandon.williams):
I'm giving up on this for now, as it can only damage my calm further to continue.

What I've found is that sometimes you will see a stack that seems to indicate the channel was already closed:

{noformat}
[junit-timeout] WARN  [Messaging-OUT-/127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES] 2021-05-26 16:38:31,164 OutboundConnection.java:492 - /127.0.0.1:7012->/127.0.0.1:7012-LARGE_MESSAGES-[no-channel] dropping message of type _TEST_1 due to error
[junit-timeout] org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: This output stream is in an unsafe state after an asynchronous flush failed
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:209)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:166)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:238)
[junit-timeout]         at org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:259)
[junit-timeout]         at org.apache.cassandra.net.AsyncMessageOutputPlus.close(AsyncMessageOutputPlus.java:117)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:1002)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:691)
[junit-timeout]         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:960)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[junit-timeout]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[junit-timeout]         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]         at java.lang.Thread.run(Thread.java:748)
{noformat}

if you add a test for the channel being open before this, it passes.  Other times the test will fail in what appears to be the same manner, but there will be no stacktrace at all.  Most of the failures occur with LARGE_MESSAGES, but given enough attemps there is occasionally one in SMALL_MESSAGES.

Then there is the unique stack that Berenguer linked, which I have yet to encounter myself.  It is odd too, in that according to it both small and large messages are expecting 30 bytes and only getting 29, but all my debugging shows 29 bytes working just fine for large messages, and 11 for small (where the stack wants 14, receiving 13.)

> Fix flaky ConnectionTest.testMessageDeliveryOnReconnect
> -------------------------------------------------------
>
>                 Key: CASSANDRA-16677
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16677
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Brandon Williams
>            Priority: Normal
>
> https://ci-cassandra.apache.org/job/Cassandra-devbranch/785/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessageDeliveryOnReconnect_compression/
> https://app.circleci.com/pipelines/github/adelapena/cassandra/460/workflows/cf7dcec6-612c-45d1-8471-623bde481dca/jobs/4069
> https://app.circleci.com/pipelines/github/adelapena/cassandra/460/workflows/b750cd38-0263-4b5e-9bb8-a8be98214378/jobs/4065



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org