You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Chris Curtin <cu...@gmail.com> on 2012/08/21 15:09:01 UTC

Cleaning up job and task tracker files on disk

Hi,

We've recently moved to a MapR cluster, but this issue occurred on our
'stock' Apache Hadoop, then Cloudera and now MapR cluster, so I think it is
a core-hadoop issue.

We've implemented a cron find-based cleanup process to remove files left
behind on the nodes, in particular under <mapred.local.dir> on each of the
task nodes.

Looking through the files that were there it doesn't appear the cluster is
cleaning up after itself very well. We looked at configuration files and
did some online searches, but we can't figure out where the settings are to
clean these directories. < mapred.local.dir >/taskTracker/hadoop/jobcache
had over 100,000 directories in it when we ran out of space last time.

Should the cluster be cleaning up these directories? If so, what parameters
control how long the files stick around? If not, anyone have a 'best
practices' script or rules for what to trim? We're doing some basic crons
that remove files older than 2 days and empty directories older than 6, but
operations (rightly) wants to know why the cluster isn't taking care of
this.

We found a number of places where files are being created by the cluster.
Some are managed correctly, most are not. What we've found:

<mapred.local.dir>/jobTracker - running jobs, files are removed (moved?)
when completed

<mapred.local.dir>/taskTracker/hadoop/jobcache - current tasks - but old
files for tasks long completed are still here - more than 100000
directories even though we set the # of jobs in
mapred.jobtracker.retiredjobs.cache.size to 1000.

<mapred.local.dir>/toBeDeleted/<date>/hadoop/jobcache - old jobs, moved
here when the node was restarted? - when are they deleted?

<mapred.local.dir>/ttprivate/tasktracker/hadoop/jobcache - currently
running jobs? Appear to be cleaned up properly

HADOOP_LOG_DIR - job_*.xml are currently running jobs (Orphans if the
cluster crashes)

HADOOP_LOG_DIR/history - what are these? Look like job files? - when are
they removed?

HADOOP_LOG_DIR/userlogs - directory per job, child directories per attempt
- never cleaned up? > 9000 job directories after being purged less than 24
hours ago

mapred-site.xml settings:

<name>mapred.jobtracker.retiredjobs.cache.size</name> <value>1000</value>

Relevant log file settings:

hadoop-env.sh:export HADOOP_LOG_DIR="/nfs/mapr/hadoop/logs"
taskcontroller.cfg:hadoop.log.dir=/nfs/mapr/hadoop/logs

Thanks,

Chris