You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Ferdy (JIRA)" <ji...@apache.org> on 2010/01/12 16:33:54 UTC

[jira] Created: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Simple check on the master overview page if the number of currently running regionservers is unchanged.
-------------------------------------------------------------------------------------------------------

Key: HBASE-2117
URL: https://issues.apache.org/jira/browse/HBASE-2117
Project: Hadoop HBase
Issue Type: New Feature
Components: master, regionserver
Affects Versions: 0.20.2
Reporter: Ferdy

Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.

It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).

Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.

I will attach a patch right away.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799289#action_12799289 ] 

stack commented on HBASE-2117:
------------------------------

Please add apache license to head of your new class. Also add a class comment saying what your new class does.  getDiagnostics is an overly generic name for something that compares content of regionservers file to count of running nodes.  Otherwise patch looks good. 

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated HBASE-2117:
-------------------------

    Status: Patch Available  (was: Open)

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801940#action_12801940 ] 

Lars George commented on HBASE-2117:
------------------------------------

Quick update since I am currently working on JMX graphing and Nagios support for HBase, I saw that Hadoop actually has it in place, the NameNode exposes

*numLiveDataNodes=INTEGER
*numDeadDataNodes=INTEGER

which are two JMX operations querying the numbers. In Nagios this would then be checked against the known total number of nodes. I am opening an issue to add this to the HBase Master too. 

This is not to discourage you are slight the approach here, just saying that it has its own merits and should be available as well.

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799286#action_12799286 ] 

stack commented on HBASE-2117:
------------------------------

Thanks for the patch.

While its true that the regionserver file will usually list all members of a cluster, on occasion it may lag what is actually up and running: e.g. up on ec2, I believe the regionserver file is not in alignment with what servers make up the cluster.  So, rather than identifying the mismatch between actual cluster members and content of the regionserver file as an ERROR, rather, I'd suggest it should be couched as a gentle hint that the two are not in alignment.  Admins should be able to easily ignore this message for the case where they have intentionally misaligned the two.  On the other hand, a gentle prompting could be handy reminder that a newly added regionserver needs to be added to the regionserver file if the admin wants it started as part of general cluster restart next time around.

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802255#action_12802255 ] 

Ferdy commented on HBASE-2117:
------------------------------

I can't tell for sure whether all non-active regionservers are administrated in the field 'deadServers'.

There's always the possibility for a regionserver to shut down in a 'proper' way, at least in such a way that the Master will not put in it's deadServers set. Also, please note the example of starting a single regionserver by hand. A check against the configuration allows for a reminder to add this newly added server to your configuration (as mentioned above by stack).

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated HBASE-2117:
-------------------------

    Status: Patch Available  (was: Open)

Set status to 'Patch available'.

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated HBASE-2117:
-------------------------

    Status: Open  (was: Patch Available)

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated HBASE-2117:
-------------------------

    Attachment: HBASE-2117.patch

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800155#action_12800155 ] 

Lars George commented on HBASE-2117:
------------------------------------

My 2c is to use Nagios et al. Add the number of regionservers (max/current) to the hmaster metrics and use a check to verify that they are the same. If not then raise an alarm with the typical escalation. That method I could assume could be adopted by the Hadoop team for datanodes and jobtrackers. 

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated HBASE-2117:
-------------------------

    Attachment: HBASE-2117-v2.patch

Thank you for the suggestions.

The patch now includes the ASF license and javadoc. The name of the method is now more specific and the diagnostic message is more subtle. Finally, the number of configured regionservers is reloaded every time the page is requested, which shouldn't be too expensive.

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802177#action_12802177 ] 

Lars George commented on HBASE-2117:
------------------------------------

The ServerManager has this in a similar fashion

{code}
  /*
   * Set of known dead servers.  On znode expiration, servers are added here.
   * This is needed in case of a network partitioning where the server's lease
   * expires, but the server is still running. After the network is healed,
   * and it's server logs are recovered, it will be told to call server startup
   * because by then, its regions have probably been reassigned.
   */
  private final Set<String> deadServers =
    Collections.synchronizedSet(new HashSet<String>());
{code}

Of course this is not like checking the list of configured server from the config compared to what is live. But that might be another way to go about it.

Does that make sense?

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Posted by "Ferdy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802142#action_12802142 ] 

Ferdy commented on HBASE-2117:
------------------------------

Indeed, exposing those 2 metrics for Hbase (live and dead regionservers) is another approach. 

I was wondering though, what method whould you be using to check the number of dead regionservers?

> Simple check on the master overview page if the number of currently running regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. The regionserver logs show some sort of termination and the affected regionserver is just removed from the master page. Besides the actual problem of the termination, what I was missing was some sort of warning (from either running client code or the master page) that some regionservers are having trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly decides to stop. The result is that the clients depending on the data in Hbase will be presented an incomplete data set, at least as long as the failing regions are not re-assigned yet. In order to have this monitored, I decided to create a patch that exposes an extra piece of information on the master page. An 'OK:' is presented if the current number of regionservers is unchanged since the start of the processes. An 'ERROR:' is shown whenever the current number is not the same. What the master page does is reading the 'regionservers' file once, and remember the number of slaves so that is can be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.