You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/06/23 13:36:47 UTC

[jira] [Created] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Repair doesn't synchronize merkle tree creation properly
--------------------------------------------------------

                 Key: CASSANDRA-2816
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0, 0.7.0
            Reporter: Sylvain Lebresne
            Assignee: Sylvain Lebresne


Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.

When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
* Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
* if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.

Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).

One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
* either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
* we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054289#comment-13054289 ] 

Peter Schuller commented on CASSANDRA-2816:
-------------------------------------------

I've thought about this problem too, and it is really significant for some use-cases. Again because so few writes are needed in order to trigger large amounts of data being sent given the merklee tree granularity.

While I'm all for a fixing it by e.g. more immediate snapshotting, I would like to raise the issue that repairs overall have pretty significant side-effects; particularly ones that can self-magnify and cause further problems. Beyond the obvious "it does disk I/O" and "It uses CPU", we have:

* Over-repair due to merklee tree granularity can cause jumps in CF sizes, killing cache locality
* Combine that with concurrent repairs then repairing the "size-jumped" set of sstables and you can magnify that effect on other nodes causing huge size increases.
* Up to recently, mixing large and small cf:s was a significant problem if you wanted to have different frequencies and different gc grace times, due to one repair blocking on another. But fixes to this and the other JIRA about concurrency, might disable the "fix" for that that was concurrent compaction - so back to square one.

I guess overall, it seems very easy to shoot yourself in the foot with repair.

Any opinions on CASSANDRA-2699 for longer term changes to repair?



> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059658#comment-13059658 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

bq. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number "6", however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8.

This is maybe correct. Node 7 will replicate to node 6 and 8 so 6 and 8 would share data.

So, to make things safe, even with this patch, every 4th node can run repair at the same time if RF=3?, but you still need to run repair on each of those 4 nodes to make sure it is all repaired?

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054841#comment-13054841 ] 

Terje Marthinussen edited comment on CASSANDRA-2816 at 6/25/11 9:04 AM:
------------------------------------------------------------------------

This sounds very interesting.

We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too.

I will see if I can get this patch tested somewhere if it is ready for that.

On a side topic, given the importance of getting tombstones properly synchronized within GCGraceSeconds, would it be an potential interesting idea to separate tombstones in different sstables to reduce the need to scan the whole dataset very frequently in the first place?

Another thought may be to make compaction deterministic or synchronized by a master across nodes so for older data, all we needed was to compare pre-stored md5s of how whole sstables? 

That is, while keeping the masterless design for updates, we could consider a master based design for how older data is being organized by the compactor. so it would be much easier to verify that "old" data is the same without any large regular scans and that data is really the same after big compactions etc.


      was (Author: terjem):
    Sounds good to me.

This sounds very interesting.
We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too.

I will see if I can get this patch tested somewhere if it is ready for that.

On a side topic, given the importance of getting tombstones properly synchronized within GCGraceSeconds, would it be an potential interesting idea to separate tombstones in different sstables to reduce the need to scan the whole dataset very frequently in the first place?

Another thought may be to make compaction deterministic or synchronized by a master across nodes so for older data, all we needed was to compare pre-stored md5s of how whole sstables? 

That is, while keeping the masterless design for updates, we could consider a master based design for how older data is being organized by the compactor. so it would be much easier to verify that "old" data is the same without any large regular scans and that data is really the same after big compactions etc.

  
> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067011#comment-13067011 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

bq. I'm kinda +1 on the simple version w/o bounds

Me too.  +1 with that change.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054237#comment-13054237 ] 

Jonathan Ellis edited comment on CASSANDRA-2816 at 6/24/11 4:48 AM:
--------------------------------------------------------------------

Supporting actual live-reading of snapshotted sstables is a little more than "polishing." It would be cool, but I wouldn't want it to block fixing repair.

      was (Author: jbellis):
    Supporting actual live-reading of snapshotted sstables is a little more than "polishing."
  
> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2816:
--------------------------------------

    Attachment: 2816-v5.txt

You're right.  v5 attached.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816-v5.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2816:
--------------------------------------

    Attachment: 2816-v2.txt

rebased and switched to unbounded executor for validations.

tests do not compile but I believe that was already the case w/ v1 -- not sure what to do with blockUntilRunning, which was removed.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059637#comment-13059637 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

Regardless of change of documentation however, I don't think it should be possible to actually trigger a scenario like this in the first place.

The system should protect the user from that.

I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number "6", however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8.

Seems like a couple of nodes too many?

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053820#comment-13053820 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

I guess a dedicated validation executor is ok as long as it still obeys the global "compaction" i/o limit.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059643#comment-13059643 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

bq. May I change to

Sure.

bq. The system should protect the user from that

I'm not sure that in a p2p design we can posit an omniscient "the system."

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055493#comment-13055493 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

bq. I am not really sure it is a good idea to do this at all, reducing caches during stress rarely improves anything

(This is on by default because the most common cause of OOMing is people configuring their caches too large.)

It sounds odd to me that repair would balloon memory usage dramatically.  Do you have monitoring graphs that show the difference in heap usage between "normal" and "repair in progress?"

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068508#comment-13068508 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

I think that if we don't want validation executor of v4 to ever queue tasks (which is what we need), then we need the executor queue to be a bounded queue of size 0 (i.e. that doesn't accept element). Indeed, as per the documentation of ThreadPoolExecutor:
{noformat}
If corePoolSize or more threads are running, the Executor always prefers queuing a request rather than adding a new thread.
{noformat} 

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2816:
--------------------------------------

    Attachment: 2816-v4.txt

v4 attached against 0.8 with a corrected uncapped validation executor.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064898#comment-13064898 ] 

Peter Schuller commented on CASSANDRA-2816:
-------------------------------------------

I'm kinda +1 on the simple version w/o bounds but not too fussy since I can obviously set it very high for my use case. In any case, the most important part for mixed small/large type of situation is that concurrent repair is possible, even if configuration changes are needed.


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059991#comment-13059991 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. Can you point out where this happens in AES?

Mostly in AES.rendezvous and AES.RepairSession. Basically, RepairSession creates a queue of jobs, a job representing the repair of a given column family (for a given range, but that comes from the session itself). AES.rendezvous is then call for each received merkleTree. It waits to have all the merkeTree for the first job in the queue. When that is done, it dequeue the job (computing the merkle tree differences and scheduling streaming accordingly) and send the tree request for the next job in the queue.
Moreover, in StorageService.forceTableRepair(), when scheduling the repair for all the ranges of the node, we actually start the session for the first range and wait for all the "jobs" for this range to be done before starting the next session.

bq. That feels like the wrong default to me. I think you can make a case for one (minimal interference with the rest of the system) or unlimited (no weird "cliff" to catch the unwary repair operator). But two is weird.

Well the rational was the following one: if you set it to two, then you're saying that as soon as you start 2 repairs in parallel, they will start being inaccurate. But as Peter was suggesting (maybe in another ticket but anyway), if you have huge CF and tiny ones, it's nice to be able to repair on the tiny ones while a repair on the huge one(s) is running. Now, making it unlimited feels dangerous, because if you do so, it means that if the use start a lot of repair, all the validation compaction will start right away. This will kill the cluster (at least a few nodes if all those repair were started on the same node). It sounded better to have degraded precision for repair in those cases rather than basically killing the nodes. Maybe 2 or 4 may be a better default than 2, but 1 is a bit limited and unlimited is clearly much too dangerous.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059962#comment-13059962 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

bq. The patch implements the idea of scheduling the merkle tree requests one by one, to make sure the tree are started as close as possible of "the same time". 

Can you point out where this happens in AES?

bq. This also put validation compaction in their own executor (to avoid them to be queued up behind standard compactions). That specific executor is created with 2 core threads, to allow for Peter's use case of wanting to do multiple repairs at the same time. That is, by default, you can do 2 repairs involving the same node and be ok

That feels like the wrong default to me.  I think you can make a case for one (minimal interference with the rest of the system) or unlimited (no weird "cliff" to catch the unwary repair operator).  But two is weird. :)

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059445#comment-13059445 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

Things definitely seems to be improved overall, but weird things still happens.

So... 12 node cluster, this is maybe ugly, I know, but start repair on all of them.
Most nodes are fine, but one goes crazy. Disk use is now 3-4 times what it was before the repair started, and it does not seem to be done yet.

I have really no idea if this is the case, but I am getting the hunch that this node has ended up streaming out some of the data it is getting in. Would this be possible?


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059485#comment-13059485 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

Cool!

Then you confirmed what I have sort of believed for a while, but my understanding of code has been a bit in conflict with:
http://wiki.apache.org/cassandra/Operations
which says:
"It is safe to run repair against multiple machines at the same time, but to minimize the impact on your application workload it is recommended to wait for it to complete on one node before invoking it against the next."

I have always read that as "if you have the HW, go for it!"

May I change to:
"It is safe to run repair against multiple machines at the same time. However, to minimize the amount of data transferred during a repair, careful synchronization is required between the nodes taking part of the repair. 

This is difficult to do if nodes with the same data replicas runs repair at the same time and doing so can in extreme cases generate excessive transfers of data. 

Improvements is being worked on, but for now, avoid scheduling repair on several nodes with replicas of the same data at the same time."



> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059655#comment-13059655 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

bq.I'm not sure that in a p2p design we can posit an omniscient "the system."

Is that a philosophical statement? :)

As Cassandra, at least for now, is a p2p network with fairly clearly defined boundaries, I will continue calling it a "system" for now :)

However, looking at it from the p2p viewpoint, the user potentially have no clue about where replicas are stored and given this, it may be impossible for the user to issue repair manually on more than one node at a time without getting in trouble. Given a large enough p2p setup, it would also be non-trivial to actually schedule a complete repair without ending up with 2 or more repairs running on the same replica set.

Since Cassandra do no checkpoint the synchronization so it is forced to rescan everything on every repair, repairs easily take so long that you are forced to run it on several nodes at a time if you are going to manage to finish repairing all nodes in 10 days...

Anyway, this is way outside the scope of this jira :)

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057339#comment-13057339 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

This is what heap looks like when GC start slowing things down so much that even gossip gets delayed long enough for nodes to be down for some seconds.

  num     #instances         #bytes  class name
----------------------------------------------
   1:       9453188      453753024  java.nio.HeapByteBuffer
   2:      10081546      392167064  [B
   3:       7616875      243740000  org.apache.cassandra.db.Column
   4:       9739914      233757936  java.util.concurrent.ConcurrentSkipListMap$Node
   5:       4131938       99166512  java.util.concurrent.ConcurrentSkipListMap$Index
   6:       1549230       49575360  org.apache.cassandra.db.DeletedColumn

I guess this really ends up maybe being the mix of everything going on in total and all the reading and writing that may occur when repair runs (valiadation compactions, streaming, normal compactions and regular traffic all at the same time and maybe many CFs at the same time).

However, I have suspected for some time that our young size was a bit on the small side and after increasing it and giving the heap a few more GB to work with, it seems like things are behaving quite a bit better.

I mentioned issues with this patch when testing for CASSANDRA-2521. That was a problem caused by me. Was playing around with git for the first time and I manage to apply 2816 to a different branch than the one I used for testing.... :(

My appologies. 

Initial testing with that corrected looks a lot better for my small scale test case, but I noticed one time where I deleted an sstable and restarted. It did not get repaired (repair scanned but did nothing).

Not entirely sure what to make out of that, I then tested to delete another sstable and repair started running.

I will test more over the next days. 


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067019#comment-13067019 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

(I'll go ahead and submit a version with that change.)

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055421#comment-13055421 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too.

This do make sure that we don't do multiple validation at the same time and that we keep a small number of merkle tree in memory at the same time. So I suppose this could help on the GC side. But overall I don't know if I am too optimistic about that, in part because I'm not sure what causes your issues. But this can't hurt on that side at least.

bq. I will see if I can get this patch tested somewhere if it is ready for that.

I believe it should be ready for that.

bq. would it be an potential interesting idea to separate tombstones in different sstables.

The thing is that some tombstones may be irrelevant become some update supersedes it (this is specially true of row tombstones). Hence basing a repair on tombstone only may transfer irrelevant data. I suppose it may depend on the use case this will be more or less a big deal. Also, this means that a read will be impacted in that we will often have to hit twice as many sstables. Given that it's not a crazy idea either to want to repair data regularly (if only for durability guarantee), I don't know if it is worth the trouble (we would have to separate tombstones from data at flush time, we'll have to maintain the two separate set of data/tombstone sstables, etc...).

bq. make compaction deterministic or synchronized by a master across nodes

Pretty sure we want to avoid going to a master architecture for everything if we can. Having master means that failure handling is more difficult (think network partition for instance) and require leader election and such, and the whole point of the fully distribution of Cassandra is to avoid those. Even without consider those, synchronizing compaction means synchronizing flush somehow and you want to be precise if you're going to use whole sstable md5s, which will be hard and quite probably inefficient.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059636#comment-13059636 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

Regardless of change of documentation however, I don't think it should be possible to actually trigger a scenario like this in the first place.

The system should protect the user from that.

I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number "6", however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8.

Seems like a couple of nodes too many?

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054841#comment-13054841 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

Sounds good to me.

This sounds very interesting.
We have also spotted very noticable issues with full GCs when the merkle trees are passed around. Hopefully this could fix that too.

I will see if I can get this patch tested somewhere if it is ready for that.

On a side topic, given the importance of getting tombstones properly synchronized within GCGraceSeconds, would it be an potential interesting idea to separate tombstones in different sstables to reduce the need to scan the whole dataset very frequently in the first place?

Another thought may be to make compaction deterministic or synchronized by a master across nodes so for older data, all we needed was to compare pre-stored md5s of how whole sstables? 

That is, while keeping the masterless design for updates, we could consider a master based design for how older data is being organized by the compactor. so it would be much easier to verify that "old" data is the same without any large regular scans and that data is really the same after big compactions etc.


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064874#comment-13064874 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

bq. making it unlimited feels dangerous, because if you do so, it means that if the use start a lot of repair, all the validation compaction will start right away

But the easy solution is "don't do that."

By setting a finite number greater than one, you have to restart machines when you realize "oh, I want to have 3 simultaneous now."

I'd rather keep it simple: make it unbounded, no configuration settings.  If you ignore the instructions to only run one repair at once, then either you know what you're doing (maybe you have SSDs) or you will find out very quickly and never do it again. :)

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059459#comment-13059459 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. So... 12 node cluster, this is maybe ugly, I know, but start repair on all of them.

Is it started on all of them ? If so, this is "kind of" expected in the sense that the patch assumes that each node does not do more than 2 repairs (for any column family) at the same time (this is configurable through the new concurrent_validators option, but it's probably better to stick to 2 and stagger the repair). If you do more than that (that is, if you did repair on all node at the same time and RF>2), then we're back on our old demons.

bq. I have really no idea if this is the case, but I am getting the hunch that this node has ended up streaming out some of the data it is getting in. Would this be possible?

Not really. That is, it could be that you create a merkle tree on some data and once you start streaming you, you're picking up data that was just streamed to you and wasn't there when computing the tree. This patch is suppose to fixes this in parts, but this can still happen if you do repairs in parallel on neighboring nodes. However, you shouldn't get into a situation where 2 nodes stream forever because they pick up what is just streamed to them for instance, because what is streaming is determined at the very beginning of the streaming session.

So my first question would be, was all those repair started in parallel. If yes, you shall not do this :). CASSANDRA-2606 and CASSANDRA-2610 are here to help making the repair of a full cluster much easier (and efficient), but right now it's more about getting patch in one at a time.
If the repairs were started one at a time in a rolling fashion, then we do have a unknown problem somewhere.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054297#comment-13054297 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

I'm not sure what you mean by "snapshotting immediately" or "polishing our snapshot support", but one approach that I think is equivalent to that (or maybe that is what you meant by 'snapshotting') would be to grab references to the sstables at the very beginning for each request and use those all throughout the repair. This has however a problem: this means we retain sstables from being deleted during repair, including sstables that are compacted in the meantime. Because repair can take a while, this will be bad. This will also require changes to the wire protocol (because we'll need a way to indicate during streaming the set of sstables to consider), and since we've kind of decided to not do that in minor releases (at least until we've discussed that), this means this cannot be released quickly. Which is bad, because I'm pretty sure this is a good part of the reason why some people with big data sets have had huge pain with repair.

Scheduling the validation one by one avoids those problems. In theory this means we'll do less work in parallel, but in practice I doubt this is a big since the goal is probably to have repair have less impact on the node rather than more. It will also make this more easy to reason about.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2816:
----------------------------------------

    Attachment: 0001-Schedule-merkle-tree-request-one-by-one.patch

Attaching patch against 0.8. The patch implements the idea of scheduling the merkle tree requests one by one, to make sure the tree are started as close as possible of "the same time". This also put validation compaction in their own executor (to avoid them to be queued up behind standard compactions). That specific executor is created with 2 core threads, to allow for Peter's use case of wanting to do multiple repairs at the same time. That is, by default, you can do 2 repairs involving the same node and be ok. More and you may experience crappy precision in repair. The new concurrent_validators parameter is exposed in case some would want more that 2. That being said, regular compactions and validations are not separated for everything and in particular throttling is shared.

As far as I can test, this successfully fixes the problems from CASSANDRA-2811 and CASSANDRA-2815. This also don't change anything on the on-wire protocol side, so I think we can target that for 0.8.2.



> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059658#comment-13059658 ] 

Terje Marthinussen edited comment on CASSANDRA-2816 at 7/5/11 2:31 AM:
-----------------------------------------------------------------------

bq. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number "6", however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8.

This is maybe correct. Node 7 will replicate to node 6 and 8 so 6 and 8 would share data.

So, to make things safe, even with this patch, every 4th node can run repair at the same time if RF=3?, but you still need to run repair on each of those 4 nodes to make sure it is all repaired?

As for the comment I made earlier.

To me, it looks like if the repair start triggering transfers on a large scale, the file the node get streamed in will not be streamed out, but this may get compacted before the repair finished and the compacted file I suspect gets streamed out and the repair just never finishe

      was (Author: terjem):
    bq. I also noticed that in this case, we have RF3. The node which is going somewhat crazy is number "6", however during the repair, it does log that it talks compares and streams data with node 4, 5, 7 and 8.

This is maybe correct. Node 7 will replicate to node 6 and 8 so 6 and 8 would share data.

So, to make things safe, even with this patch, every 4th node can run repair at the same time if RF=3?, but you still need to run repair on each of those 4 nodes to make sure it is all repaired?
  
> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068907#comment-13068907 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

Alright, v5 looks good to me. Committed, thanks.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816-v5.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2816:
----------------------------------------

    Attachment: 2816_0.8_v3.patch

bq. rebased

You rebased it against trunk while the fix version is still 0.8.2. I agree that this feel a bigger change that what we would want for a minor release, but repair is really a major pain point for users. And to tell the truth, it's worth in 0.8 than it is in 0.7 because even though the problem this patch solves exists in 0.7 as well, the splitting into range of repair made for 0.8 exacerbate those problems. Moreover this patch is fairly well delimited in what it changes, and it don't make any change to the wire protocol or anything that would make upgrades a problem. So I'm actually in favor of taking the small risk of pushing that in 0.8.2 (and be very vigilant to test repair extensively before the release). So for now, attaching a rebase with tests fixed against 0.8.

bq. and switched to unbounded executor for validations.

Ok, I realize that I'm not sure I understand what you mean by unbounded executor. In you rebased version, the ValidationExecutor apparently use the first constructor of DebuggableThreadPoolExecutor that will construct a mono-threaded executor, which is not what we want. Sure the queue of the executor will be unbounded, but if that was the problem, there was a misunderstanding, because the queue has always been unbounded, even in my initial patch. What concurrent_validators was dealing with is the number of core threads. And we need multiple core threads if we want to allow multiple concurrent repairs to work correctly.

Now the idea behind a default of 2 for the core threads was because I see a reason to want 2 concurrent repairs, but I don't see a very good reason to want more (and it's configurable if someone really need more). I'm glad to say it's not a marvelous default and the patch I've just attached used the same default than concurrent_compactors which is maybe less "random". Now we could have an executor with unbounded threads, that spawn a thread if needed making sure we never queue validation compaction, but that seems a little bit dangerous to me. It seems more reasonable to me to have a (configurable) reasonable number of threads and let validation queue up if the user start more than that number of concurrent repair (which will impact the precision of those repair, but it would be the user fault and it's a better way to deal with such fault than starting an unreasonable number of validation compaction that will starve memory (on likely more than one node btw)). But if you still think that it's better to have an unbounded number of threads, I won't fight over this.

bq. tests do not compile but I believe that was already the case w/ v1

Yes, I completely forgot to update the unit tests, sorry. Attached patch fixes those.


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054237#comment-13054237 ] 

Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------

Supporting actual live-reading of snapshotted sstables is a little more than "polishing."

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068953#comment-13068953 ] 

Hudson commented on CASSANDRA-2816:
-----------------------------------

Integrated in Cassandra-0.8 #231 (See [https://builds.apache.org/job/Cassandra-0.8/231/])
    Properly synchronize merkle tree computation
patch by slebresne; reviewed by jbellis for CASSANDRA-2816

slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149121
Files : 
* /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/AntiEntropyServiceTestAbstract.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/AntiEntropyService.java
* /cassandra/branches/cassandra-0.8/CHANGES.txt
* /cassandra/branches/cassandra-0.8/conf/cassandra.yaml
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageService.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/concurrent/DebuggableThreadPoolExecutor.java
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/db/compaction/CompactionManager.java


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 2816-v2.txt, 2816-v4.txt, 2816-v5.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054233#comment-13054233 ] 

Stu Hood commented on CASSANDRA-2816:
-------------------------------------

I'm a fan of the snapshotting immediately after receiving the request approach. In general, polishing our snapshot support to allow for this kind of usecase is likely to open up other interesting possibilities.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055455#comment-13055455 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

I don't know what causes GC when doing repairs either, but fire off repair on a few nodes with 100 million docs/node and there is a reasonable chance that a node here and there will log messages about reducing cache sizes due to memory pressure (I am not really sure it is a good idea to do this at all, reducing caches during stress rarely improves anything) or full GC.

The thought about the master controlled compaction would not really affect network splits etc.

Reconciliation after a network split is really as complex with or without a master. We need to get back to a state where all the nodes have the same data anyway which is a complex task anyway.

This is more a consideration of the fact that we do not necessarily need to live in quorum based world during compaction and we are free to use alternative approaches in the compaction without changing read/write path or affecting availability. Master selection is not really a problem here. Start compaction, talk to other nodes with the same token ranges, select a leader. 

Does not even have to be the same master every time and could consider if we could make compaction part of a background read repair to reduce the amount of times we need to read/write data. 

For instance, if we can verify that the oldest/biggest sstables is 100% in sync with data on other replicas when it is compacted (why not do it during compaction when we go through the data anyway rather than later?),can we use that info to optimize the scans done during repairs by only using data in sstables with data received after some checkpoint in time as the starting point for the consistency check?

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well as himself and assume for correction that the building of those trees will be started on every node roughly at the same time (if not, we end up comparing data snapshot at different time and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the start of the validation on the different node is subject to other compactions. 0.8 mitigates this in a way by being multi-threaded (and thus there is less change to be blocked a long time by a long running compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence it will generate lots of merkle tree requests and all of those requests will be issued at the same time. Because even in 0.8 the compaction executor is bounded, some of those validations will end up being queued behind the first ones. Even assuming that the different validation are submitted in the same order on each node (which isn't guaranteed either), there is no guarantee that on all nodes, the first validation will take the same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which is the unit at which trees are computed), we make sure that all node will start the validation at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira