You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by Noah Watkins <no...@inktank.com> on 2014/10/22 20:21:39 UTC

Finding the source of an unclean process shutdown

I am running Hadoop with a non-HDFS file system backend (Ceph), and
have noticed that some processes exit or are being killed before the
file system client is properly shutdown (i.e. FileSystem::close
completing). We need clean shutdowns right now because they release
resources that when not cleaned up lead to fs timeouts that slow every
other client down. We've adjusted the yarn timeout affecting the delay
before SIGKILL is sent to containers which resolves the problem for
containers with map tasks, but there is one instance of an unclean
shutdown that I'm having trouble tracking down.

Based on the file system trace of this unknown process it appears that
it is the AppMaster, or some other manager process. In particular it
examines all of the files related to the job (e.g. all of the teragen
files for each map task
/in-dir/_temporary/1/task_1413987694759_0002_m_000018/part-m-00018),
and the very last set of operations is the removal of many
configuration files, jar files, and directories and finally the job
directory is removed (i.e.
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1413987694759_0002).

So the first question is what process is this based on the behavior
(full trace is here http://pastebin.com/SVCfRfA4)?

After that final job directory is removed the fs trace is truncated
suggesting the process immediately exited or was killed.

So the second question is, based on what process this is (e.g. app
master) what might be causing the unclean shutdown and is there a way
to control this?

Thanks,
Noah