You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/06/01 15:53:47 UTC

[jira] [Commented] (CASSANDRA-2405) should expose 'time since last successful repair' for easier aes monitoring

    [ https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13042175#comment-13042175 ] 

Sylvain Lebresne commented on CASSANDRA-2405:
---------------------------------------------

This needs rebasing. First, two small remarks:
  * It seems we store the time in microseconds but then, when computing the time since last repair we use System.currentTimeMillis() - stored_time.
  * I would be in favor of calling the system table REPAIR_INFO, because the truth is I think it would make sense to record a number of other statistics on repair and it doesn't hurt to make the system table less specific. That also means we should probably not force any type for the value (though that can be easily changed later, so it's not a bit deal for this patch).
  * I think we usually put the code to query the system table in SystemTable, so I would move it from AntiEntropy to there.

Then more generally, a given repair involves multiple states and multiple nodes, so I don't think keeping only one timestamp is enough. Right now, we save the time of the last scheduled validation compaction on each node. With only that we're missing information so that people can do any reasonably inform decision:
    * First, this does not correspond to the last repair session started on that node, since the validation can be a request from another node. People may be interested by that information.
    * Second, given that repair concerns a given range, keeping only one general number is wrong (it would suggest the node have been repaired recently even when only one range out of 3 or 5 have been actually repaired).
   * Third, though recording the start of the validation compaction is important, this says nothing on the success of the repair (and we all know failing during repair do happen, if only because it's a fairly long operation during which node can die). So we need to record some info on the success of the operation if we don't want to return misleading information. Turns out, this is easy to record on the node coordinating the repair, maybe not so much on the other node participating in the repair.

Truth is, I'm not so sure what is the simplest way to handle this. Maybe one option could be to only register the start and end time of a repair session on the coordinator of the repair (adding the info of which range was repaired).

Also, what do people think of keeping an history (instead of just keeping the last number). I'm thinking a little bit ahead here, but what about storing one supercolumn by repair, where the super column name would be the repair session id (a TimeUUID really) and the columns infos on that repair. For this patch we would only record the range for that session, the start time and the end time (or maybe one end time for each node). But we would populate this a little bit further with stuff like CASSANDRA-2698. I think having such history would be fairly interesting.


> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.8.1
>
>         Attachments: CASSANDRA-2405-v2.patch, CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is somewhat of an undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since last successful repair for a particular column family, to make it easier to write a correct script to monitor for lack of repair in a non-buggy fashion.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira