You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Anfernee Xu <an...@gmail.com> on 2014/04/08 02:11:04 UTC

Container is hung and never be killed

Hi,

I'm running Hadoop cluster in 2.2.0 release, I encounter one issue in which
one container is never killed/clean up. What I'm doing is we set
mapreduce.task.timeout to 4 hours and due to some bug in one 3rdparty
library, the task stuck somewhere and cannot make any progress, I'm hoping
is that the NodeManager will kill the container and release any resource
the container is taking, but actually the container is still there and the
job still exists, below is the log snippets:

BTW, all jobs are uberized.

Task log:
===============================

2014-04-07 11:43:20,407 INFO [Main Thread]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for
application appattempt_1396648506811_47889_000001
2014-04-07 11:43:20,514 WARN [Main Thread]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2014-04-07 11:43:20,514 WARN [Main Thread]
org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2014-04-07 11:43:20,647 INFO [Main Thread]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2014-04-07 11:43:20,647 INFO [Main Thread]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN,
Service: , Ident:
(org.apache.hadoop.yarn.security.AMRMTokenIdentifier@11f5ecc8)
2014-04-07 11:43:20,695 INFO [Main Thread]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: The specific max attempts:
2 for application: 47889. Attempt num: 1 is last retry: false
2014-04-07 11:43:20,700 INFO [Main Thread]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred
newApiCommitter.

.....


2014-04-07 15:43:53,448 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1396648506811_47889_m_000000_0:
AttemptID:attempt_1396648506811_47889_m_000000_0 Timed out after 14400 secs
2014-04-07 15:43:53,450 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1396648506811_47889_m_000000_0 TaskAttempt Transitioned from
RUNNING to FAIL_CONTAINER_CLEANUP

....
2014-04-07 23:14:23,340 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:26,340 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:29,340 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:32,340 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:35,340 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:38,341 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:41,341 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0
2014-04-07 23:14:44,341 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1396648506811_47889_m_000000_0

So you can see the task was timeout, but state is switched to
FAIL_CONTAINER_CLEANUP.
and then "Ping from attempt_1396648506811_47889_m_000000_0" keeps showing
up for a couple of hours, looks like RMAppMaster still think the task is
alive instead of timeout?

Below is NM log:
=============================

2014-04-07 11:43:18,883 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1396648506811_47889_01_000001 transitioned from
LOCALIZING to LOCALIZED
2014-04-07 11:43:18,894 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1396648506811_47889_01_000001 transitioned from
LOCALIZED to RUNNING
2014-04-07 11:43:18,898 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
launchContainer: [nice, -n, 0, bash,
/scratch/tie/software/hadoop2/hadoop-ic/temp/nm-local-dir/usercache/tie/appcache/application_1396648506811_47889/container_1396648506811_47889_01_000001/default_container_executor.sh]
2014-04-07 11:43:19,134 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 11:43:20,136 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 11:43:20,178 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Starting resource-monitoring for container_1396648506811_47889_01_000001
2014-04-07 11:43:20,188 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 4795 for container-id
container_1396648506811_47889_01_000001: 52.1 MB of 2 GB physical memory
used; 3.3 GB of 4.2 GB virtual memory used

..............................
#timeout happens
2014-04-07 15:43:52,925 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:53,926 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:54,405 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 4795 for container-id
container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory
used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:43:54,927 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:55,928 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:56,929 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:57,415 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 4795 for container-id
container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory
used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:43:57,930 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:58,931 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:43:59,932 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-04-07 15:44:00,426 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 4795 for container-id
container_1396648506811_47889_01_000001: 153.7 MB of 2 GB physical memory
used; 3.3 GB of 4.2 GB virtual memory used
2014-04-07 15:44:00,933 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 47889 cluster_timestamp: 1396648506811 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000


I do not see any message related to attempt to kill the container, is this
a known issue? How can we work around this?

Thanks for your help

-- 
--Anfernee