You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2016/11/14 21:35:59 UTC
[jira] [Commented] (CASSANDRA-12901) Repair may hang if node dies during sync

    [ https://issues.apache.org/jira/browse/CASSANDRA-12901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15665105#comment-15665105 ] 

Paulo Motta commented on CASSANDRA-12901:
-----------------------------------------

tl;dr of CASSANDRA-3569: Before CASSANDRA-3569, stream sessions were killed when FD notified a node participating streaming was dead. Sometimes this led to false positives due to poorly configured FD thresholds, what was the reasoning behind relying exclusively on the socket timeout to detect streaming failures. This way, even if the FD detected a node was down, if the streaming connection was not killed the session could complete successfully.

After CASSANDRA-3569, repair still relies on the FD to detect node failures during validation, while un-registering from the FD when starting sync phase to rely exclusive on TCP to detect a node failure during streaming. While this works for local sync tasks, where the coordinator will detect the failure of the other stream participant via TCP, for remote sync tasks, if the stream initiator dies, the coordinator will never receive its response, making repair hang forever.

One approach to fix this is to only fail a repair session on FD failure notices from initiators of remote sync tasks. But since we will be using the FD for validation and part of the sync tasks, if the FD is not working properly we will have repair sessions killed on false positives anyway, so I'm not sure the extra complexity of skipping the failure detector only for local sync tasks is worth it. 

With this said, I propose restoring the old behavior of relying on the FD to detect node failures for the whole duration of the repair session, and retain the improvements of CASSANDRA-3569 for long running streaming sessions (rebuild, bootstrap, etc), since the idea with incremental repair is to have smaller sessions anyway, so the impact of false positives on repair syncs should be limited. Right now we already use a higher phi threshold to increase the FD confidence on repair sessions, and if necessary we can further tune this or other FD parameters to reduce the likelihood of FD false positives during repair.

I added a dtest that reproduces the issue. The patch with this proposal is very simple and basically removes the section where it unregisters from the FD after validation making repair fail fast if any participant dies (via FD) during a repair session:
||2.2||3.0||3.X||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.X...pauloricardomg:3.X-12901]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-12901]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:12901]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-dtest/lastCompletedBuild/testReport/]|


> Repair may hang if node dies during sync
> ----------------------------------------
>
>                 Key: CASSANDRA-12901
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12901
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> Since the repair coordinator unregisters from the FD after validation (CASSANDRA-3569), if the initiator of a RemoteSyncTask fails, the coordinator will never know the sync task failed and hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)