You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Peter Schuller (JIRA)" <ji...@apache.org> on 2011/01/08 19:33:45 UTC

[jira] Created: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
-----------------------------------------------------------------------------------------------------------------------

Key: CASSANDRA-1955
URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller

It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").

If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.

While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.

With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.

The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.

A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.

The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.

So, I propose either:

(a) changing the flushSorter queue size to be max(num cores, num cfs)
(b) creating a per-cf queue

I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Schuller reassigned CASSANDRA-1955:
-----------------------------------------

    Assignee: Peter Schuller

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>             Fix For: 0.7.1
>
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979233#action_12979233 ] 

Peter Schuller commented on CASSANDRA-1955:
-------------------------------------------

Oh. Sorry, I totally missed that the flushSorter was never used for live traffic.

Unless I am missing something further though, I believe the problem does stand, except the point of blocking would be the flushWriter instead of the flushSorter - and (a) is probably even less desirable given that the concurrency limitation in the flushWriter is intended for I/O purposes.

Hmmm. Given that the flushWriter is the remaining blocker, an even simpler solution then may be to just wait for CASSANDRA-1882 (will resume it soon) at which point a greater concurrency for the flushWriter should be quite acceptable and the determining factor could by default be CF count rather than sstable devices. It would still not guarantee CF's independent of each other though.


> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979199#action_12979199 ] 

Jonathan Ellis commented on CASSANDRA-1955:
-------------------------------------------

flushsorter is not even used except for binarymemtable.


> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-1955:
--------------------------------------

      Component/s: Core
    Fix Version/s: 0.7.1

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>             Fix For: 0.7.1
>
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979249#action_12979249 ] 

Stu Hood commented on CASSANDRA-1955:
-------------------------------------

> flushwriter blocking is a feature, otherwise you OOM.
That is assuming that the memtables are large: the user had a large number of CFs, and so each memtable would be relatively small.

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-1955:
----------------------------------------

    Fix Version/s:     (was: 0.7.2)
                   0.7.3

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>             Fix For: 0.7.3
>
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1955.
---------------------------------------

       Resolution: Not A Problem
    Fix Version/s:     (was: 1.0)
         Assignee:     (was: Peter Schuller)

CASSANDRA-2006 and CASSANDRA-2427 should address this adequately. (The former by making memtable flushes arbitrarily fine-grained as you increase queue size to the desired length, and the latter by getting rid of the flush storm that often happens when you hit the default flush period.)

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979241#action_12979241 ] 

Peter Schuller commented on CASSANDRA-1955:
-------------------------------------------

Heh, I'm aware of that ;) Let me try to make myself clearer. Maybe I'm missing something which means this is not a problem, but to establish that I first have to make sure I'm making myself understandable.

You definitely want to limit the amount of heap space used for memtables, no problem there. The problem, if I have understood things correctly, is that the amount of memtables pending flush (i.e., switched away from + actively being written) is limited by the flush writer concurrency and the queue length. The flush writer concurrency being tweaked with respect to I/O concerns.

So, given a keyspace with many column families the problem is one of timing. Suppose you have 10 column families that are all written at a reasonable pace, but they all end up triggering a memtable switch at the same time (this is what I interpreted the OP's situation as on the mailing list), you get a sudden spike of memtable flushes that is independent of the actual write throughput. If this spike or peak in the number of pending memtables is higher than queue length + concurrency, you suddenly block on writes, even though you were never even close to saturating the write capacity of the node/cluster.

Does this make sense or have I misunderstood how the flush writer executor interacts with the memtable flushing process?



> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208004#comment-13208004 ] 

Peter Schuller commented on CASSANDRA-1955:
-------------------------------------------

Except (I just looked at {{CommitLogAllocator.flushOldestTables()}}) that if you have a significant amount of memtables that are only ever flushed due to {{commitlog_total_space_in_mb}}, they seem to be flushed in a single storm, and should thus be able to trigger this (not tested).
                
> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979237#action_12979237 ] 

Jonathan Ellis commented on CASSANDRA-1955:
-------------------------------------------

flushwriter blocking is a *feature*, otherwise you OOM. :)

> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (CASSANDRA-1955) memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208007#comment-13208007 ] 

Peter Schuller commented on CASSANDRA-1955:
-------------------------------------------

Actually scratch that. Looks like the write path no longer has any potential for synchronous flushing (if I'm not mistaken) so we should no longer block writes even if we *do* storm.

                
> memtable flushing can block writes due to queue size limitations even though overall write throughput is below capacity
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>
> It seems that someone ran into this (see cassandra-user thread "Question re: the use of multiple ColumnFamilies").
> If my interpretation is correct, the queue size is set to the concurrency in the case of the flushSorter, and set to memtable_flush_writers in the case of flushWriter in ColumnFamilyStore.
> While the choice of concurrency for the two executors makes perfect sense, the queue sizing does not. As a user, I would expect, and did expect, that for a given memtable independently tuned (w.r.t. flushing thresholds etc), writes to the CF would not block until there is at least one other memtable *for that CF* waiting to be flushed.
> With the current behavior, if I am not misinterpreting, whether or not writes will inappropriately block is very much dependent on not just the overall write throughput, but also the incidental timing of memtable flushes across multiple column families.
> The simplest way to mitigate (but not fix) this is probably to set the queue size to be equal to the number of column families if that is higher than the number of CPU cores. But that is only a mitigation because nothing prevents e.g. a large number of memtable flushes for a small column family under temporary write load, can still block a large (possibly more important) memtable flush for another CF. Such a shared-but-larger queue would also not prevent heap usage spikes resulting from some a single cf with very large memtable thresholds being rapidly written to, with a queue sized for lots of cf:s that are in practice not used. In other words, this mitigation technique would effectively negate the backpressure mechanism in some cases and likely lead to more people having OOM issues when saturating a CF with writes.
> A more involved change is to make each CF have it's own queue through which flushes go prior to being submitted to flushSorter, which would guarantee that at least one memtable can always be in pending flush state for a given CF. The global queue could effectively have size 1 hard-coded since the queue is no longer really used as if it were a queue.
> The flushWriter is unaffected since it is a separate concern that is supposed to be I/O bound. The current behavior would not be perfect if there is a huge discrepancy between memtable flush thresholds of different memtables, but it does not seem high priority to make a change here in practice.
> So, I propose either:
> (a) changing the flushSorter queue size to be max(num cores, num cfs)
> (b) creating a per-cf queue
> I'll volunteer to work on it as a nice bite sized change, assuming there is agreement on what needs to be done. Given the concerns with (a), I think (b) is the right solution unless it turns out to cause major complexity. Worth noting is that these are not performance sensitive given the low frequency of memtable flushes, so an extra queue:ing step should not be an issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira