You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Justin (JIRA)" <ji...@apache.org> on 2011/01/13 20:49:45 UTC

[jira] Created: (HBASE-3442) Master failing when node disconnects or dies

Master failing when node disconnects or dies
--------------------------------------------

Key: HBASE-3442
URL: https://issues.apache.org/jira/browse/HBASE-3442
Project: HBase
Issue Type: Bug
Components: master, regionserver
Affects Versions: 0.90.0
Environment: CentOS 5, Hbase .90 RC3, Amazon EC2
Reporter: Justin
Priority: Minor

We've got our servers running on Amazon EC2 and nodes will go through some shutdown scripts if/when we want to take them out of the mix. Ended up shutting down one of the nodes, in this case Node98, which cased the immediate crash of the master server. Upon restarting the master, it would attempt to contact the missing node, and then stop it's startup process. I believe the node removed itself from the DNS server first, then ran a stop on the datanode, and regionserver. The missing node was also removed from any slave/regionserver list on the master server. I finally put in a bogus entry in the /etc/hosts file for the missing node, pointing it back to 127.0.0.1, and the master server finally marked it as a dead node, ignored it, and finished the startup process.

Going to try and replicate it again and save some more logs, the following log is the only thing I saved from the first occurrence; It's the master failing to start up while checking for the missing node: http://pastebin.com/ZyQMQm91

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3442) Master failing when node disconnects or dies

Posted by "Justin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984796#action_12984796 ] 

Justin commented on HBASE-3442:
-------------------------------

Went through the same process to replicate the issue this time with HadoopNode99, which was also hosting the -META- table.  Not 100% if that was the case last time as well, going to go out on a limb and assume it was, I'll go through the process again on a server not hosting root or meta later today.  Manually removed it from the DNS server and made sure /etc/hosts was clear of any references to it.  That alone didn't cause any issues, assuming references to it's IP are cached in -root-.  Then I went ahead and gave EC2 the go ahead to kill the server.  Server still did it's normal shutdown process of stopping regionserver and datanode, then vanishing into the nothingness of the cloud.  Then I received the following angry logs, web UI deaths, as well as it still passing out the hadoopnode99 host to all client servers:

http://pastebin.com/TjJcwqMq

Again, to revive the master, I put a faulty entry in /etc/hosts and restarted the master.

> Master failing when node disconnects or dies
> --------------------------------------------
>
>                 Key: HBASE-3442
>                 URL: https://issues.apache.org/jira/browse/HBASE-3442
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.90.0
>         Environment: CentOS 5, Hbase .90 RC3, Amazon EC2
>            Reporter: Justin
>            Priority: Minor
>
> We've got our servers running on Amazon EC2 and nodes will go through some shutdown scripts if/when we want to take them out of the mix.  Ended up shutting down one of the nodes, in this case Node98, which cased the immediate crash of the master server.  Upon restarting the master, it would attempt to contact the missing node, and then stop it's startup process.  I believe the node removed itself from the DNS server first, then ran a stop on the datanode, and regionserver.  The missing node was also removed from any slave/regionserver list on the master server.  I finally put in a bogus entry in the /etc/hosts file for the missing node, pointing it back to 127.0.0.1, and the master server finally marked it as a dead node, ignored it, and finished the startup process.
> Going to try and replicate it again and save some more logs, the following log is the only thing I saved from the first occurrence;  It's the master failing to start up while checking for the missing node:  http://pastebin.com/ZyQMQm91

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.