You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2016/06/08 16:43:20 UTC
[jira] [Created] (YARN-5214) Pending on synchronized method
DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Junping Du created YARN-5214:
--------------------------------
Summary: Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Key: YARN-5214
URL: https://issues.apache.org/jira/browse/YARN-5214
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked:
1. Node Status Updater thread get blocked by 0x000000008065eae8
{noformat}
"Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
- waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
at java.lang.Thread.run(Thread.java:745)
{noformat}
2. The actual holder of this lock is DiskHealthMonitor:
{noformat}
"DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000]
java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.createDirectory(Native Method)
at java.io.File.mkdir(File.java:1316)
at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
- locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{noformat}
This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here.
The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org