You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/06/11 00:19:45 UTC

[jira] Created: (HBASE-678) hbase needs a 'safe-mode'

hbase needs a 'safe-mode'
-------------------------

                 Key: HBASE-678
                 URL: https://issues.apache.org/jira/browse/HBASE-678
             Project: Hadoop HBase
          Issue Type: Improvement
            Reporter: stack
            Priority: Minor


Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.

We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629240#action_12629240 ] 

Billy Pearson commented on HBASE-678:
-------------------------------------

yes that's what I was thanking we could still queue up the compaction request on the region server but only start the compactions once the master leaves safe mode. 
Let the master send the message to the region server or let the region server query the master for safe mode status.

If we do not want to stop compaction while in safe mode then we need to decline to close a region for redeployment while there is a compaction happening.



> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Daniel Leffel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627779#action_12627779 ] 

Daniel Leffel commented on HBASE-678:
-------------------------------------

One extra thing to tack onto this would be to block region balancing while in Safe Mode. Currently, I have 600 regions or so and as HBase is starting up, a lot of churn of regions closing and opening is happening during startup. It would be great if no balancing happened while in safe mode and then upon existing safe mode, regions got balanced.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-678) hbase needs a 'safe-mode'

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629223#action_12629223 ] 

stack commented on HBASE-678:
-----------------------------

Yeah, Billy, as you suggest elsewhere, compaction should not be inline with close.

In above, you are suggesting that when in 'safe mode', no compactions or splits so balancing happens promptly?

On exit of 'safe mode', the compactions and splits could begin?

I suppose the master can send a message to all regionservers when it wants all to leave 'safe mode'.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman reassigned HBASE-678:
-----------------------------------

    Assignee: Jim Kellerman

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.3.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-678:
--------------------------------

    Fix Version/s:     (was: 0.18.0)
                   0.19.0

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Billy Pearson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629069#action_12629069 ] 

Billy Pearson commented on HBASE-678:
-------------------------------------

I thank we should have a multi stage/process safe mode on the master not just for the clients but to handle crash recovery of regions and region balancing while we are in safe mode.
we have a issue for helping the balancing out on HBASE-862 but I thank it will still be helpful to include all start up balancing while in safe mode

Assuming we do not run just queue up needed compaction/split checks on loading regions while in safe mode.
Stage 1: Deploy all regions
Stage 2:  Do any crash recovery needed and do a flush to get that to disk (remove recovery logs on success flush)
Stage 3: Do any balancing of the regions before exiting the safe mode if needed.

Stage 3 is there so we do not have any compactions or splits running on the regions and we can move them around as we need to to balance the region count out. 
If there is no compactions running closes happen immediately.

I seen some re balancing happen on start up and the region servers go crazy trying to balance as Daniel commented above. 
This in my cluster is mostly from regions closing having to wait for running compaction creating a lag in the balancing counts
When the compactions finish and the region get closed and redeploy the counts are all out of balance again and the same thing happens over and over until almost all the compactions are done
and the regions can close and redeploy with out lag of the compactions.
Once we have done the above all will be ready for the clients to connect to the cluster with out having to worry about churn in balancing or crash recovering regions.

Daniel: If we block region balancing while in Safe Mode your clients can connect when we come out of safe mode but then balancing will kick in and you will see the same churn as we have now.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman resolved HBASE-678.
---------------------------------

    Resolution: Fixed

All tasks completed. Resolving issue.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-678:
--------------------------------

    Priority: Blocker  (was: Critical)

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-678) hbase needs a 'safe-mode'

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-678:
--------------------------------

    Fix Version/s: 0.3.0
         Priority: Critical  (was: Minor)

Marking as critial for 0.3.0 because several other issues will depend on this.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.3.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase restart w/ master on new node.  Just so happened that one of the regionservers was running extra slow (was downloaded by other processes).  Meant that its portion of the assigments was taking a long time to come up...  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to attach to a cluster not yet fully up.  UI should show when all assignments have been successfully made so admin can at least see when they have a problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.