You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2010/11/29 16:55:11 UTC

[jira] Created: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

YouAreDeadException being swallowed in HRS getMaster()
------------------------------------------------------

                 Key: HBASE-3280
                 URL: https://issues.apache.org/jira/browse/HBASE-3280
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.90.0
            Reporter: Jonathan Gray
            Assignee: Jonathan Gray
             Fix For: 0.90.0, 0.92.0


In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964799#action_12964799 ] 

Jonathan Gray commented on HBASE-3280:
--------------------------------------

Somewhat related to this, what happened on a cluster here is that the HRS got stuck in this loop trying to reconnect to master and ignoring the YouAreDeadExceptions.  But then once the master finished shutdown handling, it removes this server from the dead server list.  Then the RS actually successfully heartbeated in to the master and the master thought it was a legit RS (even though it just finished doing a shutdown of it).

Is there a reason we should ever clear things out of the dead server list?  If this RS is in a network partition it may not check back with the master for a long time so we should always remember the dead serverNames (which include start codes)?

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3280:
---------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed to trunk and branch.  Will be doing cluster testing with this patch soon.

Thanks for review stack.  Opened HBASE-3282 for dead server stuff.

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>         Attachments: HBASE-3280-v1.patch
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964860#action_12964860 ] 

stack commented on HBASE-3280:
------------------------------

On remembering forever dead servers (with startcode), yeah, that sounds about right.  Could keep the server in a soft references Map.  Want to make new issue?

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>         Attachments: HBASE-3280-v1.patch
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964858#action_12964858 ] 

stack commented on HBASE-3280:
------------------------------

+1

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>         Attachments: HBASE-3280-v1.patch
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3280:
-------------------------

    Status: Patch Available  (was: Open)

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>         Attachments: HBASE-3280-v1.patch
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3280) YouAreDeadException being swallowed in HRS getMaster()

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3280:
---------------------------------

    Attachment: HBASE-3280-v1.patch

Properly catch and rethrow if we are given a YouAreDeadException doing heartbeat to master.

> YouAreDeadException being swallowed in HRS getMaster()
> ------------------------------------------------------
>
>                 Key: HBASE-3280
>                 URL: https://issues.apache.org/jira/browse/HBASE-3280
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>             Fix For: 0.90.0, 0.92.0
>
>         Attachments: HBASE-3280-v1.patch
>
>
> In the HRS, when we lose our connection to the master, we enter into a loop where we keep trying to get the new master location in ZK and attempt to send our heartbeat.  Within tryRegionServerReport() we could get a YouAreDeadException, but we won't let it out.  This leads to the RS continuously heartbeating in to the master although the master keeps telling it to kill itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.