You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Noah Watkins <no...@inktank.com> on 2014/10/22 22:04:03 UTC

Finding source of unclean app master shutdown

I appologize for the duplicate of this on yarn-dev. I realized later
that this probably is more related to MR.

I am running MR with a non-HDFS file system backend (Ceph), and have
noticed that some processes exit or are being killed before the file
system client is properly shutdown (i.e. FileSystem::close
completing). We need clean shutdowns right now because they release
resources that when not cleaned up lead to fs timeouts that slow every
other client down. We've adjusted the yarn timeout affecting the delay
before SIGKILL is sent to containers which resolves the problem for
containers with map tasks, but there is one instance of an unclean
shutdown that I'm having trouble tracking down.

Based on the file system trace of this unknown process it appears that
it is the AppMaster, or some other manager process. In particular it
stats all of the files related to the job, and at the end removes many
configuration files, COMMIT_SUCCESS file, and finally removes the job
staging directory, which seems to match up with the behavior of the
AppMaster.

So the first question is am I actually seeing the behavior of the
AppMaster (full trace is here http://pastebin.com/SVCfRfA4)?

After that final job staging directory is removed the fs trace is
truncated suggesting the process immediately exited or was killed.

So the second question is, if this is the app master, what might be
causing the unclean fs shutdown and is there a way to control this?

I noticed that in MRAppMaster::main there is
`conf.setBoolean("fs.automatic.close", false);` but I cannot seem to
find any instance of the file systems having close called explicity.


Thanks,
Noah