You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Matt Byrd (JIRA)" <ji...@apache.org> on 2017/06/19 23:03:00 UTC
[jira] [Comment Edited] (CASSANDRA-13480) nodetool repair can hang
forever if we lose the notification for the repair completing/failing
[ https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054909#comment-16054909 ]
Matt Byrd edited comment on CASSANDRA-13480 at 6/19/17 11:02 PM:
-----------------------------------------------------------------
||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|
|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|
was (Author: mbyrd):
||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|
> nodetool repair can hang forever if we lose the notification for the repair completing/failing
> ----------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: Matt Byrd
> Assignee: Matt Byrd
> Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in question is the notification which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever.
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested in, we can replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost and so for good measure I have made RepairRunner poll periodically to see if there were any notifications that had been sent but we didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different approach, I also tried setting:
> JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios but in this test we don't even send that many notifications so I'm not surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with jmx as far as I can tell.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org