You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "nkeywal (JIRA)" <ji...@apache.org> on 2012/06/28 18:39:44 UTC

[jira] [Created] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

nkeywal created HBASE-6290:
------------------------------

             Summary: Add a function a mark a server as dead and start the recovery the process
                 Key: HBASE-6290
                 URL: https://issues.apache.org/jira/browse/HBASE-6290
             Project: HBase
          Issue Type: Improvement
          Components: monitoring
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Assignee: nkeywal
            Priority: Minor


ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s

However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in


It could be a hbase shell function such as
considerAsDead ipAddress|serverName

This would delete all the znodes of the server running on this box, starting the recovery process.


Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409675#comment-13409675 ] 

nkeywal commented on HBASE-6290:
--------------------------------

@stack We should try to connect the RS, someone we trust told us it was dead. Or if we try it should be with a minimum timeout (if not, out socket timeout will be longer than the zookeeper timeout). So the shell command should just clean the znode associated to an IP.

It could also be in ZK, or very strongly linked to ZK if we can: if the API allows it, get the session associated to this IP and expire them. We know it's easy to expire a session :-).
                
> Add a function a mark a server as dead and start the recovery the process
> -------------------------------------------------------------------------
>
>                 Key: HBASE-6290
>                 URL: https://issues.apache.org/jira/browse/HBASE-6290
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting the recovery process.
> Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403207#comment-13403207 ] 

stack commented on HBASE-6290:
------------------------------

What would the shell invocation do?  Connect to a RS and call its shutdown or shutdown + kill znode?  What are you thinking would use this new facility (It sounds like a good thing to have.  Would be good to list possible users).
                
> Add a function a mark a server as dead and start the recovery the process
> -------------------------------------------------------------------------
>
>                 Key: HBASE-6290
>                 URL: https://issues.apache.org/jira/browse/HBASE-6290
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting the recovery process.
> Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6290:
-------------------------

      Tags: noob
    Labels: noob  (was: )
    
> Add a function a mark a server as dead and start the recovery the process
> -------------------------------------------------------------------------
>
>                 Key: HBASE-6290
>                 URL: https://issues.apache.org/jira/browse/HBASE-6290
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>              Labels: noob
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting the recovery process.
> Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411545#comment-13411545 ] 

stack commented on HBASE-6290:
------------------------------

Sounds good.
                
> Add a function a mark a server as dead and start the recovery the process
> -------------------------------------------------------------------------
>
>                 Key: HBASE-6290
>                 URL: https://issues.apache.org/jira/browse/HBASE-6290
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>              Labels: noob
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting the recovery process.
> Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411560#comment-13411560 ] 

nkeywal commented on HBASE-6290:
--------------------------------

For ZK, there is nothing to get the sessions/expire in the 3.4 API. But iterating on the znodes and deleting them is ok. 
                
> Add a function a mark a server as dead and start the recovery the process
> -------------------------------------------------------------------------
>
>                 Key: HBASE-6290
>                 URL: https://issues.apache.org/jira/browse/HBASE-6290
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>              Labels: noob
>
> ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s
> However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in
> It could be a hbase shell function such as
> considerAsDead ipAddress|serverName
> This would delete all the znodes of the server running on this box, starting the recovery process.
> Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master & region servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira