You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sudhir Vallamkondu <su...@gmail.com> on 2010/08/30 19:30:13 UTC
Multiple dirs in mapred.local.dir property
We are testing our cluster map-reduce performance by specifying one
dir vs multiple dirs in property “mapred.local.dir”. The documentation
for this property says “ The local directory where MapReduce stores
intermediate data files. May be a comma-separated list of directories
on different devices in order to spread disk i/o”. So I was expecting
a performance boost when specifying two local dirs vs one dir for
property “mapred.local.dir”. We did a sort test and saw the opposite.
- One dir config defaults to ${hadoop.tmp.dir}/mapred/local.
“hadoop.tmp.dir” is set to "/var/lib/hadoop-0.20/cache/${user.name}"
- Two dir config explicitly sets the “mapred.local.dir” in mapred-site.xml
<property>
<name>mapred.local.dir</name>
<value>/data1/hadoop/mapred/local,/data2/hadoop/mapred/local</value>
</property>
/data1, /data2 are separate drives on each hadoop cluster instance box
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 330G 128G 186G 41% /
/dev/sda1 190M 18M 163M 10% /boot
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/sdb1 2.2T 442G 1.8T 20% /data1
/dev/sdc1 2.2T 484G 1.8T 22% /data2
---------- Sort test with one vs two dir --------------------
1. For 40 GB Data :
a. Results when one dir was specified in “mapred.local.dir” :
Time taken for random-data generation : 39 mins 3 sec
Time taken for random-data Sort : 1hrs, 8mins, 5sec
Time taken for sorted-data Validation : 3 mins 22 sec
b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 36mins, 51sec
Time taken for random-data Sort : 1hrs, 38mins, 33sec
Time taken for sorted-data Validation : 3mins, 31sec
2. For 100 GB Data :
a. Results when one dir was specified in “mapred.local.dir” :
Time taken for random-data generation : 1hrs, 33mins, 28sec
Time taken for random-data Sort : 3hrs, 50mins, 39sec
Time taken for sorted-data Validation : 7mins, 33sec
b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 1hrs, 27mins, 17sec
Time taken for random-data Sort : 6hrs, 35mins, 20sec
Time taken for sorted-data Validation : 8mins, 52sec
The random data generation time had a slight performance gain however
the sort job (which is also I/O intensive) almost doubled in both
instances. Any reason why this is happening?