You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kevin Tse <ke...@gmail.com> on 2010/06/08 12:51:20 UTC

Questions about mapred.local.dir

Hi, everyone.
I am running a Hadoop-0.19.2 cluster of 4 linux boxes. it is bad that for
the moment we don't have much available disk space on these 4 nodes, 1.5 TB
in total, spreading over 8 disks, 2 for each node. and the available space
in these disks are not equal.

I have the following configuration.
I tried to run a job that would write estimated amount of 1.2
TB intermediate data. I thought that with the following configuration, no
any disk's space would drop under 2 GB, but it was not ture, two of the
disks' space dropped to 0 before the job finished, so I had to kill the job
and free the space for other applications running on those machines.

we are going to buy 3*1TB disks, this may solve the problem, but I still
want to know how to properly set the following 3 properties.

<property>
      <name>mapred.local.dir</name>
      <value>/disk1/mapred/local,/disk2/mapred/local</value>
   </property>
<property>
  <name>mapred.local.dir.minspacestart</name>
  <value>4096000000</value>
</property>
<property>
  <name>mapred.local.dir.minspacekill</name>
  <value>2048000000</value>
</property>

And there's another problem, while my MR job is being executed in the hadoop
cluster, for each tasktacker, there are many INFO log message as this:

2010-06-07 16:51:20,721 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_201006071648_0002
/attempt_201006071648_0002_r_000001_0/output/file.out
in any of the configured local directories

I don't know whether this is harmless, but it seems so cause my MR job
completed successfully.

And another question, is it possible to make the reduces start to run before
all the maps complete?

Thank you in advance.
Kevin Tse