You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2010/11/29 17:21:14 UTC

[jira] Commented: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

    [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964799#action_12964799 ] 

Jonathan Gray commented on HBASE-3280:
--------------------------------------

Somewhat related to this, what happened on a cluster here is that the HRS got stuck in this loop trying to reconnect to master and ignoring the YouAreDeadExceptions.  But then once the master finished shutdown handling, it removes this server from the dead server list.  Then the RS actually successfully heartbeated in to the master and the master thought it was a legit RS (even though it just finished doing a shutdown of it).

Is there a reason we should ever clear things out of the dead server list?  If this RS is in a network partition it may not check back with the master for a long time so we should always remember the dead serverNames (which include start codes)?

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.