You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Chris Nauroth (JIRA)" <ji...@apache.org> on 2016/06/01 00:10:13 UTC

[jira] [Commented] (HADOOP-13219) NameNode Rpc Reader Thread crash, and cluster hang.

    [ https://issues.apache.org/jira/browse/HADOOP-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308929#comment-15308929 ] 

Chris Nauroth commented on HADOOP-13219:
----------------------------------------

Do you happen to know what kind of exception it was that caused the threads to crash?

Catching {{Throwable}} can be problematic.  Let's assume it was an {{OutOfMemoryError}}.  If there was a failure to allocate memory, and we catch the error and proceed, how do we understand what state the process is in currently?  What if we made partial updates to in-memory state?  Since {{OutOfMemoryError}} can be thrown by nearly anything, we effectively have no idea what state we're in at this point.  For the NameNode, the inode tree might be in an unusual state, and not reflected back to persistent store in fsimage or edit log transactions.

There is already a catch of {{OutOfMemoryError}} at another layer in the RPC client.  It's a bit of code I disagree with.  Some of us choose to run the NameNode JVM with {{-XX:OnOutOfMemoryError}} set to a command to self-terminate.  That's a choice that favors correctness over robustness.

> NameNode Rpc Reader Thread crash, and cluster hang.
> ---------------------------------------------------
>
>                 Key: HADOOP-13219
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13219
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: rpc-server
>    Affects Versions: 2.5.0, 2.6.0, 2.8.0, 2.7.2, 2.6.2, 2.6.4
>            Reporter: ChenFolin
>              Labels: patch
>         Attachments: HADOOP-13219-3.patch, HDFS-10472-2.patch, HDFS-10472.patch
>
>
> My Cluster hang yesterday .
> Becuase the rpc server Reader threads crash. So all rpc request  timeout, include datanode hearbeat &.
> We can see , the method doRunLoop just catch InterruptedException and IOException:
> while (running) {
>           SelectionKey key = null;
>           try {
>             // consume as many connections as currently queued to avoid
>             // unbridled acceptance of connections that starves the select
>             int size = pendingConnections.size();
>             for (int i=size; i>0; i--) {
>               Connection conn = pendingConnections.take();
>               conn.channel.register(readSelector, SelectionKey.OP_READ, conn);
>             }
>             readSelector.select();
>             Iterator<SelectionKey> iter = readSelector.selectedKeys().iterator();
>             while (iter.hasNext()) {
>               key = iter.next();
>               iter.remove();
>               if (key.isValid()) {
>                 if (key.isReadable()) {
>                   doRead(key);
>                 }
>               }
>               key = null;
>             }
>           } catch (InterruptedException e) {
>             if (running) {                      // unexpected -- log it
>               LOG.info(Thread.currentThread().getName() + " unexpectedly interrupted", e);
>             }
>           } catch (IOException ex) {
>             LOG.error("Error in Reader", ex);
>           } 
>         }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org