You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Carl Yeksigian (JIRA)" <ji...@apache.org> on 2013/01/01 20:24:12 UTC

[jira] [Commented] (CASSANDRA-4554) Log when a node is down longer than the hint window and we stop saving hints

    [ https://issues.apache.org/jira/browse/CASSANDRA-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541900#comment-13541900 ] 

Carl Yeksigian commented on CASSANDRA-4554:
-------------------------------------------

I've started working on this issue; saving that a node needs repair is easy, but tracking the repair is difficult since only the nodes participating in the repair know its state.

I'll outline the case that has me stumped. For simplicity, I assume that Node 1 overlaps only with Node 2.
- Node 1 goes down, stays down longer than hint window
- Node 3 stops saving hints for Node 1, marks Node 1 as needs repair
- Node 1 comes back online
- Node 4 starts repair between Node 2 and Node 1 by forwarding the streaming repair task
- Node 1 is now up to date and no longer needs repair; Node 2 and Node 4 know this from tracking the repair task
- Node 3 does not discover this is the case, continues to see Node 1 as needs repair; state can only be updated if Node 3 initiates a repair

Because of this, I think that the hint state would need to be gossiped. Also, because repairs are based on cfs, the gossiped object needs to be on a cf basis, not on a node basis, so application state isn't granular enough to capture this additional state.

I think the possibilities are:
# My example is wrong and I'm missing a component
# The new value needs to be gossiped
# The new value can be incorporated into application state somehow
# Coordinator tells all nodes about the state of the repair. In this case, down nodes would not receive these updates
# Nodes can only exit the needs repair state by each node executing the repair. Since it is only informative, this may make sense, but seems misleading
                
> Log when a node is down longer than the hint window and we stop saving hints
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4554
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4554
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>            Assignee: Carl Yeksigian
>            Priority: Minor
>             Fix For: 1.2.1
>
>
> We know that we need to repair whenever we lose a node or disk permanently (since it may have had undelivered hints on it), but without exposing this we don't know when nodes stop saving hints for a temporarily dead node, unless we're paying very close attention to external monitoring.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira