You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Coverston (JIRA)" <ji...@apache.org> on 2011/04/07 17:42:08 UTC

[jira] [Created] (CASSANDRA-2433) Failed Streams Break Repair

Failed Streams Break Repair
---------------------------

                 Key: CASSANDRA-2433
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.7.4
            Reporter: Benjamin Coverston
            Assignee: Sylvain Lebresne
             Fix For: 0.7.5


Running repair in cases where a stream fails we are seeing multiple problems.

1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
2. The temp files are left behind and multiple failures can end up filling up the data partition.

These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.

This issue is also being worked on w.r.t CASSANDRA-208, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038181#comment-13038181 ] 

Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:10 PM:
--------------------------------------------------------------

0001
* Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it
* Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure still be useful?
* success might need to be volatile as well

Thanks Sylvain!

      was (Author: stuhood):
    0001
* Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it
* Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure still be useful?

Thanks Sylvain!
  
> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair.patch
                0003-Report-streaming-errors-back-to-repair.patch
                0002-Register-in-gossip-to-handle-node-failures.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0004-Reports-validation-compaction-errors-back-to-repair-v3.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0002-Register-in-gossip-to-handle-node-failures.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093832#comment-13093832 ] 

Jonathan Ellis commented on CASSANDRA-2433:
-------------------------------------------

- Why do we need the new AE_SESSIONS stage?
- I prefer using WrappedRunnable to a Callable when you want to allow exceptions but don't care about a return value
- I think we can avoid a bunch of no-op onConvicts if RepairSession were to subscribe to FD directly instead of going through Gossip (i.e., leave IEndpointStateChangeSubscriber unchanged and expose convict in IFailureDetectionEventListener for when we need to go low-level).  Gossip is about high-level "events" which doesn't really fit here.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433_v2.patch

Attached v2 is rebased and use a higher conviction threshold before deciding to fail the repair, as the goal here is to avoid having a repair getting stuck for hours, but we want to avoid stopping a repair just because a node got into a longer than usual GC pause.

The threshold used is twice the configured phi_convict_threshold. This give 16 by default, which if I trust the original 'phi accrual failure detection' should give an order of magnitude less false positive than 8 (for about an order of magnitude in the detection time though). It feels reasonable to me but if a FD specialist want to voice his opinion, please do.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair-v3.patch
                0003-Report-streaming-errors-back-to-repair-v3.patch
                0002-Register-in-gossip-to-handle-node-failures-v3.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v3.patch

Attaching v3 rebased (on 0.8).

bq. Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?

The thing is that repair session are very long lived. And MISC is single threaded. So that would block other task that are not supposed to block. We could make MISC multi-threaded but even then it's not a good idea to mix short lived and long lived task on the same stage.

bq. I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it

Done in v3.

bq. Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?

Good idea, it's slightly simpler, done in v3.

bq. The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread

Done in v3. That will probably change with CASSANDRA-2610 anyway (which I have to update)

bq. Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)

I did not change that, but if it's a problem for retries to not be volatile, I suspect having StreamInSession.current not volatile is also a problem. But really I'd be curious to see that be a problem.

bq. Playing devil's advocate: would sending a half-built tree in case of failure still be useful?

I don't think it is. Or more precisely, if you do send half-built tree, you'll have to be careful that the other doesn't consider what's missing as ranges not being in sync (I don't think people will be happy with tons of data being stream just because we happen to have a bug that make compaction throw an exception during the validation). So I think you cannot do much with a half-built tree, and it will add complication. For a case where people will need to restart a repair anyway once whatever happened is fixed

bq. success might need to be volatile as well

Done in v3.


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v3.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures-v3.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair-v3.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v3.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch
                0003-Report-streaming-errors-back-to-repair-v2.patch
                0002-Register-in-gossip-to-handle-node-failures-v2.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch

Attaching rebased patch (against 0.8.1). It also change the behavior a little bit so as to not fail repair right away if a problem occur (it still throw an exception at the end if any problem had occured). It turns out to be slightly simpler that way. Especially for CASSANDRA-1610.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0002-Register-in-gossip-to-handle-node-failures-v2.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0003-Report-streaming-errors-back-to-repair-v3.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094701#comment-13094701 ] 

Hudson commented on CASSANDRA-2433:
-----------------------------------

Integrated in Cassandra-0.8 #306 (See [https://builds.apache.org/job/Cassandra-0.8/306/])
    Make repair report failure when a participating node dies
patch by slebresne; reviewed by jbellis for CASSANDRA-2433

slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163677
Files : 
* /cassandra/branches/cassandra-0.8/CHANGES.txt
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/gms/FailureDetector.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/gms/Gossiper.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/gms/IEndpointStateChangeSubscriber.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/gms/IFailureDetectionEventListener.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/AntiEntropyService.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/MigrationManager.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageLoadBalancer.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageService.java
* /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/AntiEntropyServiceTestAbstract.java


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch, 2433_v4.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 2433_v3.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0003-Report-streaming-errors-back-to-repair.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0003-Report-streaming-errors-back-to-repair.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094631#comment-13094631 ] 

Jonathan Ellis commented on CASSANDRA-2433:
-------------------------------------------

+1

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch, 2433_v4.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433_v3.patch

(Sorry, I had attached the wrong version of v3, corrected now)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2433:
--------------------------------------

             Reviewer: stuhood
          Component/s: Core
    Affects Version/s:     (was: 0.7.4)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0004-Reports-validation-compaction-errors-back-to-repair.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v3.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038181#comment-13038181 ] 

Stu Hood edited comment on CASSANDRA-2433 at 5/23/11 8:09 PM:
--------------------------------------------------------------

0001
* Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?

0002
* I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it
* Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread

0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)

0004
* Playing devil's advocate: would sending a half-built tree in case of failure still be useful?

Thanks Sylvain!

      was (Author: stuhood):
    0001
* Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?
0002
* I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it
* Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread
0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)
0004
* Playing devil's advocate: would sending a half-built tree in case of failure still be useful?

Thanks Sylvain!
  
> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066839#comment-13066839 ] 

Stu Hood commented on CASSANDRA-2433:
-------------------------------------

Hey Sylvain: sorry it took me so long to get back to this one. Would you mind rebasing it?

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0002-Register-in-gossip-to-handle-node-failures.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433_v3.patch

bq. Someone's probably going to want the JMX information but let's keep Stages for Verb-associated tasks

Sounds good, updated patch add a new executor directly into AntiEntropy.

bq. How about splitting onDead and onRestart in EndpointStateChange, then?

Done.

bq. I prefer using WrappedRunnable to a Callable

I changed to use WrappedRunnable. However, we still need to have access to both the repair session and the future from the executor so the implementation returns a pair of those two objects. I'm only marginally convinced this is cleaner than the previous solution...


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093874#comment-13093874 ] 

Sylvain Lebresne commented on CASSANDRA-2433:
---------------------------------------------

bq. Why do we need the new AE_SESSIONS stage?

If you mean "why AE_SESSIONS when we already have the AE stage?", then it is because repair push stuffs on the AE stage that it wait for, so we would deadlock. If you mean "why a stage?", it felt cleaner that just a Thread now that we want to check for exception at the end of the exception. If you mean "why a stage rather than a simple ThreadExecutor?", it is a good question. I guess it was just some reflex of mine to get a JMXEnabledThreadPool, but it's probably not worth a stage, not even the jmx enabledness maybe.

bq. I prefer using WrappedRunnable to a Callable when you want to allow exceptions but don't care about a return value

Agreed. I'll update the patch.

bq. I think we can avoid a bunch of no-op onConvicts if RepairSession were to subscribe to FD directly instead of going through Gossip

Yeah, I kind of started with that but the problem is that we must deal with the case of a node restarting before it has been convicted (especially if the conviction threshold is higher), which the FD won't see. We could deal of that last situation separately and have Gossip call some trigger into AntiEntropy on a gossip generation change to indicate to stop every started session involving the given endpoint, but creating a dependency of gossip to anti-entropy didn't felt like a good idea a priori.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433.patch

Attaching a rebase of the two previous first patches as '2433.patch'. That is, this patch adds registering in gossip so that repair fails and report it to the user when a node participating to the repair dies. Compared to the previous version, it fails fast because it's the easier thing to do now and a better option imho.

I should mention that while it is lame that repair get stuck when a node dies and we should fix it, this means that if a node is wrongly marked down, we will fail repair for no reason (but I suppose it's a failure detector problem).

Attached patch is against 0.8. This has no upgrade consequence of any sort and is a reasonably simple patch, so I think it could be worth committing in 0.8.
The rest of what was in previous patch 0003 and 0004 cannot go into 0.8 because it changes the wire protocol, so I will rebase against trunk directly, and maybe in another ticket. Having this first patch committed would help with that though :)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.4
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch
                0003-Report-streaming-errors-back-to-repair-v4.patch
                0002-Register-in-gossip-to-handle-node-failures-v4.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch

Attaching v4 that is rebased and simply set the reties variable in StreamInSession volatile after all (I've removed old version because it was a mess).

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0004-Reports-validation-compaction-errors-back-to-repair.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094591#comment-13094591 ] 

Jonathan Ellis commented on CASSANDRA-2433:
-------------------------------------------

bq. we still need to have access to both the repair session and the future from the executor so the implementation returns a pair of those two objects

You can still use the RepairFuture approach, just use the FutureTask(Runnable, V) constructor

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0003-Report-streaming-errors-back-to-repair-v2.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0002-Register-in-gossip-to-handle-node-failures-v3.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment:     (was: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433_v4.patch

You're right, don't know why I got carried away like that. v4 "fixes" this.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch, 2433_v3.patch, 2433_v4.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2433:
------------------------------------------

    Description: 
Running repair in cases where a stream fails we are seeing multiple problems.

1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
2. The temp files are left behind and multiple failures can end up filling up the data partition.

These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.

This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

  was:
Running repair in cases where a stream fails we are seeing multiple problems.

1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
2. The temp files are left behind and multiple failures can end up filling up the data partition.

These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.

This issue is also being worked on w.r.t CASSANDRA-208, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.7.5
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2433:
--------------------------------------

    Reviewer: yukim  (was: stuhood)

Yuki, can you review this patch?

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.4
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093881#comment-13093881 ] 

Jonathan Ellis commented on CASSANDRA-2433:
-------------------------------------------

bq. it's probably not worth a stage, not even the jmx enabledness maybe

Someone's probably going to want the JMX information but let's keep Stages for Verb-associated tasks.

bq. the problem is that we must deal with the case of a node restarting before it has been convicted (especially if the conviction threshold is higher), which the FD won't see

How about splitting onDead and onRestart in EndpointStateChange, then?  Then RS could implement convict and onRestart (ignoring onDead); other ESCS listeners could implement onRestart == onDead.  That would maintain the "ESCS is about events, FDEL is low-level convict information" separation of roles.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair.patch
                0003-Report-streaming-errors-back-to-repair.patch
                0002-Register-in-gossip-to-handle-node-failures.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch

Attached patches are against 0.8.

This tries to catch what can go wrong with repair and reports it back to the user by making the full repair throw an exception. More precisely:
  * patch 0001: add a method to repair for reporting failure and propagate that up to the repair session. This puts repair session on a specific stage (instead of having RepairSession be a Thread) and use a future to allow waiting on completion. This allows a cleaner API to deal with errors (the Future.get() simply throw an ExecutionException) and this add the advantage of stage management to repair sessions.
  * patch 0002: Make repair session register through gossip to be informed of node dying and failing the session when that happens.
  * patch 0003: Reports errors during streaming to the repair session. This actually introduces a generic way to handle streaming failures and after that we should probably update the other user of streaming to deal correctly with failure too.
  * patch 004: Catch errors during validation compaction and push them up to repair (whether those happens on the coordinator of the repair or not).

Note that this includes streaming failures and thus includes stuffs from the patch of Aaron Morton attached on CASSANDRA-2088, but contrarily to that patch, it takes the approach of failing fast. This means that if streaming fails on a file, it fails the streaming altogether (same for repair). I think this is simpler code-wise and more useful from the point of view of the user, since a failure means the use will have to retry anyway.

Last but not least, this makes some modification to messages. So either this goes into 0.8.0 (which I think it should, because this really is a bug fix and fixes something that is a pain for users), or we should had a new messaging version for 0.8.0 and modify this to take it into account (we should probably add a 0.8.0 version to the messaging service anyway).


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038181#comment-13038181 ] 

Stu Hood commented on CASSANDRA-2433:
-------------------------------------

0001
* Since we're not trying to control throughput or monitor sessions, could we just use Stage.MISC?
0002
* I think RepairSession.exception needs to be volatile to ensure that the awoken thread sees it
* Would it be better if RepairSession implemented IEndpointStateChangeSubscriber directly?
* The endpoint set needs to be threadsafe, since it will be modified by the endpoint state change thread, and the AE_STAGE thread
0003
* Should StreamInSession.retries be volatile/atomic? (likely they won't retry quickly enough for it to be a problem, but...)
0004
* Playing devil's advocate: would sending a half-built tree in case of failure still be useful?

Thanks Sylvain!

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 0002-Register-in-gossip-to-handle-node-failures.patch, 0003-Report-streaming-errors-back-to-repair-v2.patch, 0003-Report-streaming-errors-back-to-repair.patch, 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data partition.
> These issues together are making repair very difficult for nearly everyone running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8 for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira