You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sudhir Vallamkondu <su...@gmail.com> on 2010/08/30 19:30:13 UTC

Multiple dirs in mapred.local.dir property

We are testing our cluster map-reduce performance by specifying one
dir vs multiple dirs in property “mapred.local.dir”. The documentation
for this property says “ The local directory where MapReduce stores
intermediate data files. May be a comma-separated list of directories
on different devices in order to spread disk i/o”. So I was expecting
a performance boost when specifying two local dirs vs one dir for
property “mapred.local.dir”. We did a sort test and saw the opposite.

- One dir config defaults to ${hadoop.tmp.dir}/mapred/local.
“hadoop.tmp.dir” is set to   "/var/lib/hadoop-0.20/cache/${user.name}"

- Two dir config explicitly sets the “mapred.local.dir” in mapred-site.xml
   <property>
       <name>mapred.local.dir</name>
       <value>/data1/hadoop/mapred/local,/data2/hadoop/mapred/local</value>
  </property>

/data1, /data2 are separate drives on each hadoop cluster instance box

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             330G  128G  186G  41% /
/dev/sda1             190M   18M  163M  10% /boot
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/sdb1             2.2T  442G  1.8T  20% /data1
/dev/sdc1             2.2T  484G  1.8T  22% /data2

---------- Sort test with one vs two dir --------------------

1. For 40 GB Data :

a. Results when one dir was specified in “mapred.local.dir”  :
Time taken for random-data generation : 39 mins 3 sec
Time taken for random-data Sort :  1hrs, 8mins, 5sec
Time taken for sorted-data Validation : 3 mins 22 sec

b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 36mins, 51sec
Time taken for random-data Sort : 1hrs, 38mins, 33sec
Time taken for sorted-data Validation : 3mins, 31sec

2. For 100 GB Data :

a.  Results when one dir was specified in “mapred.local.dir”  :
Time taken for random-data generation : 1hrs, 33mins, 28sec
Time taken for random-data Sort : 3hrs, 50mins, 39sec
Time taken for sorted-data Validation : 7mins, 33sec

b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 1hrs, 27mins, 17sec
Time taken for random-data Sort : 6hrs, 35mins, 20sec
Time taken for sorted-data Validation : 8mins, 52sec

The random data generation time had a slight performance gain however
the sort job (which is also I/O intensive) almost doubled in both
instances. Any reason why this is happening?