You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "elsif (JIRA)" <ji...@apache.org> on 2009/11/09 20:17:32 UTC

[jira] Created: (HBASE-1964) Add internal status monitoring to RegionServer

Add internal status monitoring to RegionServer
----------------------------------------------

                 Key: HBASE-1964
                 URL: https://issues.apache.org/jira/browse/HBASE-1964
             Project: Hadoop HBase
          Issue Type: Improvement
          Components: client
    Affects Versions: 0.20.1
            Reporter: elsif 


When a hadoop/hbase cluster is under heavy load it will inevitably reach a tipping point where data is lost or corrupted.  A
graceful method is needed to put the cluster into safe mode until more resources can be added or the load on the cluster has been
reduced.  

St.Ack has suggested the following short-term task: "Meantime, it should be possible to have a cron run a script that checks
cluster resources from time-to-time -- e.g. how full hdfs is, how much each regionserver is carrying -- and when it determines the needle is in the red,
flip the cluster to be read-only."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1964) Enter temporary "safe mode" to ride over transient FS layer problems

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1964:
----------------------------------

    Affects Version/s:     (was: 0.20.1)
        Fix Version/s: 0.21.0
             Assignee: Andrew Purtell
              Summary: Enter temporary "safe mode" to ride over transient FS layer problems  (was: Add internal status monitoring to RegionServer)

Refocus this issue as "Enter temporary "safe mode" to ride over transient FS layer problems", as part of ride over restart.

> Enter temporary "safe mode" to ride over transient FS layer problems
> --------------------------------------------------------------------
>
>                 Key: HBASE-1964
>                 URL: https://issues.apache.org/jira/browse/HBASE-1964
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: client
>            Reporter: elsif 
>            Assignee: Andrew Purtell
>             Fix For: 0.21.0
>
>
> When a hadoop/hbase cluster is under heavy load it will inevitably reach a tipping point where data is lost or corrupted.  A
> graceful method is needed to put the cluster into safe mode until more resources can be added or the load on the cluster has been
> reduced.  
> St.Ack has suggested the following short-term task: "Meantime, it should be possible to have a cron run a script that checks
> cluster resources from time-to-time -- e.g. how full hdfs is, how much each regionserver is carrying -- and when it determines the needle is in the red,
> flip the cluster to be read-only."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1964) Enter temporary "safe mode" to ride over transient FS layer problems

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1964:
----------------------------------

    Issue Type: Sub-task  (was: Improvement)
        Parent: HBASE-2183

> Enter temporary "safe mode" to ride over transient FS layer problems
> --------------------------------------------------------------------
>
>                 Key: HBASE-1964
>                 URL: https://issues.apache.org/jira/browse/HBASE-1964
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>          Components: client
>            Reporter: elsif 
>            Assignee: Andrew Purtell
>             Fix For: 0.21.0
>
>
> When a hadoop/hbase cluster is under heavy load it will inevitably reach a tipping point where data is lost or corrupted.  A
> graceful method is needed to put the cluster into safe mode until more resources can be added or the load on the cluster has been
> reduced.  
> St.Ack has suggested the following short-term task: "Meantime, it should be possible to have a cron run a script that checks
> cluster resources from time-to-time -- e.g. how full hdfs is, how much each regionserver is carrying -- and when it determines the needle is in the red,
> flip the cluster to be read-only."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1964) Add internal status monitoring to RegionServer

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775432#action_12775432 ] 

Andrew Purtell commented on HBASE-1964:
---------------------------------------

bq. When a hadoop/hbase cluster is under heavy load it will inevitably reach a tipping point where data is lost or corrupted

We take exception to this statement. One can corrupt an Oracle database by overcommitting RAM such that the kernel panics in get_free_page (on Linux). 

bq. A graceful method is needed to put the cluster into safe mode until more resources can be added or the load on the cluster has been reduced. 

There is no substitute for competent monitoring and administration of production systems, especially ones which try to support terascale or petascale storage and computation over 10s or 100s of servers. However, certainly it is the case that HBase has opportunities to sense overloading and take self preserving actions where currently it does not.

> Add internal status monitoring to RegionServer
> ----------------------------------------------
>
>                 Key: HBASE-1964
>                 URL: https://issues.apache.org/jira/browse/HBASE-1964
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 0.20.1
>            Reporter: elsif 
>
> When a hadoop/hbase cluster is under heavy load it will inevitably reach a tipping point where data is lost or corrupted.  A
> graceful method is needed to put the cluster into safe mode until more resources can be added or the load on the cluster has been
> reduced.  
> St.Ack has suggested the following short-term task: "Meantime, it should be possible to have a cron run a script that checks
> cluster resources from time-to-time -- e.g. how full hdfs is, how much each regionserver is carrying -- and when it determines the needle is in the red,
> flip the cluster to be read-only."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.