You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/06/22 17:08:47 UTC

[jira] [Created] (CASSANDRA-2811) Repair doesn't stagger flushes

Repair doesn't stagger flushes
------------------------------

                 Key: CASSANDRA-2811
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2811
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0
            Reporter: Sylvain Lebresne
            Assignee: Sylvain Lebresne
             Fix For: 0.8.2


When you do a nodetool repair (with no options), the following things occured:
* For each keyspace, a call to SS.forceTableRepair is issued
* In each of those calls: for each token range the node is responsible for, a repair session is created and started
* Each of these session will request one merkle tree by column family (to each node for which it makes sense, which includes the node the repair is started on)

All those merkle tree requests are done basically at the same time. And now that compaction is multi-threaded, this means that usually more than one validation compaction will be started at the same time. The problem is that a validation compaction starts by a flush. Given that by default the flush_queue_size is 4 and the number of compaction thread is the number of processors and given that on any recent machine the number of core will be >= 4, this means that this will easily end up blocking write for some period of time.

It turns out to also have a more subtle problem for repair itself. If two validation compaction for the same column family (but different range) are started in a very short time interval, the first validation will block on the flush, but the second one may not block at all if the memtable is clean when it request it's own flush. In which case that second validation will be executed on data older than it should.

I think the simpler fix is to make sure we only ever do one validation compaction at a time. It's probably a better use of resources anyway. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2811) Repair doesn't stagger flushes

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053302#comment-13053302 ] 

Sylvain Lebresne commented on CASSANDRA-2811:
---------------------------------------------

The question that remains is whether we prefer adding a specific mono-threaded executor for validation compaction (could make sense) or simply introduce a validationCompactionLock.

> Repair doesn't stagger flushes
> ------------------------------
>
>                 Key: CASSANDRA-2811
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2811
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 0.8.2
>
>
> When you do a nodetool repair (with no options), the following things occured:
> * For each keyspace, a call to SS.forceTableRepair is issued
> * In each of those calls: for each token range the node is responsible for, a repair session is created and started
> * Each of these session will request one merkle tree by column family (to each node for which it makes sense, which includes the node the repair is started on)
> All those merkle tree requests are done basically at the same time. And now that compaction is multi-threaded, this means that usually more than one validation compaction will be started at the same time. The problem is that a validation compaction starts by a flush. Given that by default the flush_queue_size is 4 and the number of compaction thread is the number of processors and given that on any recent machine the number of core will be >= 4, this means that this will easily end up blocking write for some period of time.
> It turns out to also have a more subtle problem for repair itself. If two validation compaction for the same column family (but different range) are started in a very short time interval, the first validation will block on the flush, but the second one may not block at all if the memtable is clean when it request it's own flush. In which case that second validation will be executed on data older than it should.
> I think the simpler fix is to make sure we only ever do one validation compaction at a time. It's probably a better use of resources anyway. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-2811) Repair doesn't stagger flushes

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne resolved CASSANDRA-2811.
-----------------------------------------

    Resolution: Duplicate

Marking this duplicate of CASSANDRA-2816 as the patch on the latter includes this.

> Repair doesn't stagger flushes
> ------------------------------
>
>                 Key: CASSANDRA-2811
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2811
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 0.8.2
>
>
> When you do a nodetool repair (with no options), the following things occured:
> * For each keyspace, a call to SS.forceTableRepair is issued
> * In each of those calls: for each token range the node is responsible for, a repair session is created and started
> * Each of these session will request one merkle tree by column family (to each node for which it makes sense, which includes the node the repair is started on)
> All those merkle tree requests are done basically at the same time. And now that compaction is multi-threaded, this means that usually more than one validation compaction will be started at the same time. The problem is that a validation compaction starts by a flush. Given that by default the flush_queue_size is 4 and the number of compaction thread is the number of processors and given that on any recent machine the number of core will be >= 4, this means that this will easily end up blocking write for some period of time.
> It turns out to also have a more subtle problem for repair itself. If two validation compaction for the same column family (but different range) are started in a very short time interval, the first validation will block on the flush, but the second one may not block at all if the memtable is clean when it request it's own flush. In which case that second validation will be executed on data older than it should.
> I think the simpler fix is to make sure we only ever do one validation compaction at a time. It's probably a better use of resources anyway. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2811) Repair doesn't stagger flushes

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054283#comment-13054283 ] 

Peter Schuller commented on CASSANDRA-2811:
-------------------------------------------

One of the huge benefits of concurrent compaction is that it significantly helps with a mix of small and large column families. If compaction is forced to be serial, we're back to the situation that a 'nodetool repair' of the small 1 gig CF can block for 3 days waiting on a huge repair of a 800 gig CF.

It would be nice if that could be avoided, or at least be tweakable. Can we for example just make repair (without a cf specified) do one repair at a time? I.e., fully repair a single CF, then do the next, etc.

It seems that it should provide sensible out-of-the-box behavior, while still retaining the concurrency as desired for cases when specific CF:s are repaired at different intervals.

If the concurrency came not just from multiple CF:s but also from multiple ranges, then it would be nice if all the ranges for a given CF could be treated as "one" compaction I think.


> Repair doesn't stagger flushes
> ------------------------------
>
>                 Key: CASSANDRA-2811
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2811
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 0.8.2
>
>
> When you do a nodetool repair (with no options), the following things occured:
> * For each keyspace, a call to SS.forceTableRepair is issued
> * In each of those calls: for each token range the node is responsible for, a repair session is created and started
> * Each of these session will request one merkle tree by column family (to each node for which it makes sense, which includes the node the repair is started on)
> All those merkle tree requests are done basically at the same time. And now that compaction is multi-threaded, this means that usually more than one validation compaction will be started at the same time. The problem is that a validation compaction starts by a flush. Given that by default the flush_queue_size is 4 and the number of compaction thread is the number of processors and given that on any recent machine the number of core will be >= 4, this means that this will easily end up blocking write for some period of time.
> It turns out to also have a more subtle problem for repair itself. If two validation compaction for the same column family (but different range) are started in a very short time interval, the first validation will block on the flush, but the second one may not block at all if the memtable is clean when it request it's own flush. In which case that second validation will be executed on data older than it should.
> I think the simpler fix is to make sure we only ever do one validation compaction at a time. It's probably a better use of resources anyway. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira