You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Zhe Zhang (JIRA)" <ji...@apache.org> on 2016/09/26 16:32:21 UTC

[jira] [Created] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

Zhe Zhang created HADOOP-13657:
----------------------------------

             Summary: IPC Reader thread could silently die and leave NameNode unresponsive
                 Key: HADOOP-13657
                 URL: https://issues.apache.org/jira/browse/HADOOP-13657
             Project: Hadoop Common
          Issue Type: Bug
          Components: ipc
            Reporter: Zhe Zhang
            Priority: Critical


For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from run time exception. Then the {{pendingConnections}} queue became full and the NameNode port became inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org