You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2016/03/02 02:07:18 UTC

[jira] [Commented] (CASSANDRA-11286) streaming socket never times out

    [ https://issues.apache.org/jira/browse/CASSANDRA-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174751#comment-15174751 ] 

Paulo Motta commented on CASSANDRA-11286:
-----------------------------------------

In order to verify that socket timeout was indeed not being respected, I added a new property {{cassandra.dtest.sleep_during_stream_write}} to [this branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-unpatched] and set it to a value much longer than {{streaming_socket_timeout_in_ms}} on [this dtest|https://github.com/pauloricardomg/cassandra-dtest/tree/11286], and verified that the bootstrap streaming session hanged forever.

Two changes are necessary to make stream socket timeout be enforced/respected during a stream session:
* Creation of socket via {{Channels.newChannel(socket.getInputStream());}} on {{ConnectionHandler.getReadChannel(socket)}}, as suggested in the [blog post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/].
* Set socket timeout on follower side on {{IncomingStreamingConnection}}

After these changes, I re-executed the previous dtest on [a fixed branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-testing] and verified that the bootstrap stream session did not hang, but instead failed. The reason for the stream to fail and not to be retried is because {{SocketTimeoutException}} is an {{IOException}}, so it's catch by [this block|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L52] (and not [this|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L58]) on {{IncomingFileMessage}}.

This behavior of failing on socket timeout is not the one documented on {{cassandra.yaml}}:
{noformat}
# Enable socket timeout for streaming operation.
# When a timeout occurs during streaming, streaming is retried from the start
# of the current file. This _can_ involve re-streaming an important amount of
# data, so you should avoid setting the value too low.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}

So I updated it to:
{noformat}
# Set socket timeout for streaming operation.
# The stream session is failed if no data is received by any of the
# participants within that period.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}

While retrying when receiving corrupted data is probably the right approach, I'm not sure retrying on a socket timeout is desirable here. I can see two main reasons for the socket to timeout:
* Connection was broken/reset in only one side of the socket (rare but possible situation)
* Deadlock or protocol error on sender side

In both scenarios, I think failing stream is the correct approach, rather than retrying and dealing with unexpected error conditions. WDYT [~yukim]?

Below are branches with the suggested changes and tests.

||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-11286]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-11286]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-dtest/lastCompletedBuild/testReport/]|

commit info: minor conflict on 2.2, but other than that it merges cleanly upwards.

> streaming socket never times out
> --------------------------------
>
>                 Key: CASSANDRA-11286
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11286
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> While trying to reproduce CASSANDRA-8343 I was not able to trigger a {{SocketTimeoutException}} by adding an artificial sleep longer than {{streaming_socket_timeout_in_ms}}.
> After investigation, I detected two problems:
> * {{ReadableByteChannel}} creation via {{socket.getChannel()}}, as done in {{ConnectionHandler.getReadChannel(socket)}}, does not respect {{socket.setSoTimeout()}}, as explained in this [blog post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/]
> ** bq. The only difference between “blocking NIO” and “NIO wrapped around IO” is that you can’t use socket timeout with SocketChannels. Why ? Read a javadoc for setSocketTimeout(). It says that this timeout is used only by streams.
> * {{socketSoTimeout}} is never set on "follower" side, only on initiator side via {{DefaultConnectionFactory.createConnection(peer)}}.
> This may cause streaming to hang indefinitely, as exemplified by CASSANDRA-8621:
> bq. For the scenario that prompted this ticket, it appeared that the streaming process was completely stalled. One side of the stream (the sender side) had an exception that appeared to be a connection reset. The receiving side appeared to think that the connection was still active, at least in terms of the netstats reported by nodetool. We were unable to verify whether this was specifically the case in terms of connected sockets due to the fact that there were multiple streams for those peers, and there is no simple way to correlate a specific stream to a tcp session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)