You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/06/19 01:41:24 UTC

[jira] Created: (HBASE-2752) Don't retry forever when waiting on too many store files

Don't retry forever when waiting on too many store files
--------------------------------------------------------

                 Key: HBASE-2752
                 URL: https://issues.apache.org/jira/browse/HBASE-2752
             Project: HBase
          Issue Type: Improvement
            Reporter: Jean-Daniel Cryans
            Assignee: stack
            Priority: Critical
             Fix For: 0.20.5, 0.21.0


HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.

We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880376#action_12880376 ] 

stack commented on HBASE-2752:
------------------------------

So, after chatting w/ J-D (and Dave), the minimally invasive fix would keep the hbase-2087 fix that narrowed the flush block to a single memstore rather than block al flushes but keep the 0.20.3 behavior where if a compaction hadn't happened w/i 90seconds, go ahead w/ the flush anyways.

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2752.
--------------------------

    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed branch and trunk.  Resovling.  Thanks for review J-D.

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880422#action_12880422 ] 

Dave Latham commented on HBASE-2752:
------------------------------------

Thanks for the quick work.  It's really aprpeciated.  I'll try to get this patch tested on a cluster.

Minor nits:
* The log on "Cache flush failed" should use toStringBinary for the region name.
* blockingWaitTime / 100 seems somewhat arbitrary for check interval, but probably fine for now.


> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880390#action_12880390 ] 

Jean-Daniel Cryans commented on HBASE-2752:
-------------------------------------------

I like it. Some comments:

 - requeueCount in FQE could be a boolean, that's how it's used.
 - isMaximumWait isn't documented

With that fixed and some cluster load testing, I'm +1 for commit.

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880426#action_12880426 ] 

stack commented on HBASE-2752:
------------------------------

Thanks j-d for review.  I added in your first suggestion.  For the second, I kept count.  I think it'll be of use when we have a jsp page that dumps current state of the flush queue.

I've been running it up on cluster.  I see some of these during a big upload:

{code}
2010-06-18 18:02:17,864 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Waited 90495ms on a compaction to clean up 'too many store files'; waited long enough... proceeding with flush
{code}

...so it looks like we got the 0.20.3 behavior back where we'll go ahead and flush regardless if we've waited N ms (I left the interval at the 0.20.3 90 seconds which seems a bit long but...).

I'm going to commit and roll an RC

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2752:
-------------------------

    Attachment: 2752.txt

Notes on the patch for 0.20.5RC4:

Adds a delayqueue to hold regions to flush.

The delay queue takes a data structure that holds Region and time of construction so can tell how long we've been hanging out in the queue.

The flushRegion method was refactored to remove crud.  Alot of the crud was old comments talking of 'compactions running inline with flush', behaviors long-since left behind.  The flushRegion was split up into two methods.  One that will check if we should delay first and then another method that holds everything else.  The former is called whenever we flush normally.  The latter is called directly when emergency flush required.

Removed checkStoreFileCount.  It was only being used in the emergency flush case but, its use here was incorrect (if its an emergency flush, don't want to wait if too many store files).

Please review.

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880427#action_12880427 ] 

stack commented on HBASE-2752:
------------------------------

Applied to branch and trunk (Lets talk jgray).

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: 2752.txt
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880369#action_12880369 ] 

Jean-Daniel Cryans commented on HBASE-2752:
-------------------------------------------

This is something that Dave Latham was hitting when trying out 0.20.5 RC3. One of his region server was compacting non-stop and a region could be blocked for minutes.

> Don't retry forever when waiting on too many store files
> --------------------------------------------------------
>
>                 Key: HBASE-2752
>                 URL: https://issues.apache.org/jira/browse/HBASE-2752
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.5, 0.21.0
>
>
> HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a  region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc.
> We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.