You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ryan King (JIRA)" <ji...@apache.org> on 2010/02/19 04:20:27 UTC

[jira] Created: (CASSANDRA-809) Full disk can result in being marked down

Full disk can result in being marked down
-----------------------------------------

                 Key: CASSANDRA-809
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.5, 0.6, 0.7
            Reporter: Ryan King


We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):

When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
(org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy).  The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.

Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:

SHORT TERM:
* change the  TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled

LONG TERM
* It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
* Perhaps we could use the failure detector on disks?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-809) Full disk can result in being marked down

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-809:
-------------------------------------

    Affects Version/s:     (was: 0.7)
                           (was: 0.6)
                           (was: 0.5)
        Fix Version/s: 0.7

> Full disk can result in being marked down
> -----------------------------------------
>
>                 Key: CASSANDRA-809
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan King
>             Fix For: 0.7
>
>
> We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):
> When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
> (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy).  The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.
> Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:
> SHORT TERM:
> * change the  TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
> LONG TERM
> * It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
> * Perhaps we could use the failure detector on disks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-809) Full disk can result in being marked down

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836008#action_12836008 ] 

Ryan King commented on CASSANDRA-809:
-------------------------------------

> > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled 
> 
> disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush) 

I'll have to think more about this. I agree that we need to not let the queues grow in an unbounded way, but our current setup (basically all threads can be consumed by one queue and some of them will wait indefinitely for conditions).

We need to decide which kind of failure we want here. Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).

> > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory 
> 
> yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet. 

I agree. We'll take a look at this early next week.

> > Perhaps we could use the failure detector on disks? 
> 
> Not sure what this looks like but I agree our story here needs a lot of improvement. 

I'm not entirely sure what this looks like either, but here are the properties I'd like cassandra to have:

* if a disk fills up, we stop trying to write to it
* if we're about to write more data to a disk than space available, we don't try and write to that disk
* we balance data relatively evenly between disks
* if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)

> Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

This is probably the best advice for new clusters. Unfortunately we can't easily implement this right now.

> Full disk can result in being marked down
> -----------------------------------------
>
>                 Key: CASSANDRA-809
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Ryan King
>
> We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):
> When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
> (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy).  The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.
> Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:
> SHORT TERM:
> * change the  TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
> LONG TERM
> * It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
> * Perhaps we could use the failure detector on disks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-809) Full disk can result in being marked down

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835572#action_12835572 ] 

Jonathan Ellis commented on CASSANDRA-809:
------------------------------------------

> change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled 

disagree.  you can only do this by uncapping the collection, which is a cure worse than the disease.  (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)

> It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory 

yes, this would be my preferred design.  should be straightforward code to write, just hasn't been done yet.

> Perhaps we could use the failure detector on disks? 

Not sure what this looks like but I agree our story here needs a lot of improvement.

Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

> Full disk can result in being marked down
> -----------------------------------------
>
>                 Key: CASSANDRA-809
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.5, 0.6, 0.7
>            Reporter: Ryan King
>
> We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):
> When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
> (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy).  The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.
> Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:
> SHORT TERM:
> * change the  TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
> LONG TERM
> * It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
> * Perhaps we could use the failure detector on disks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.