You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bwolen Yang <wb...@gmail.com> on 2007/06/08 22:30:50 UTC

write and sort performance

Hi.

I ran some performance tests (randomwrite/sort) on a small Hadoop
cluster.  The numbers are unexpected.  Some numbers are so far off, I
suspect that either I didn't tune all the right knobs, or I had wrong
expectations.  Hence I am seeking suggestions on how to tune the
cluster and/or explanations on how underlying parts of Hadoop works.



Here is the setup:
--------------------------
cluster:
  hadoop 0.12.3
  jdk 1.6.0_01
  HDFS file replication factor: 3
  7 machines total
    1 machine for namenode
    1 machine for jobtracker
    5 other machines for slaves (datanodes / mapreduce)


machine spec:
  2GHz duo core AMD Opteron
  8GB memory
  1 Gbit ethernet
  64-bit Linux 2.6.18-8.1.1.el5

measured disk/network performance:
  ~75MB/sec disk write (tested by "dd if=/dev/zero of=$f ..." to
generate 20GB file)
  ~76MB/sec disk read (tested by "cat $f > /dev/null" on the 20GB
generated above)
  ~30MB/sec for copying file from one machine to another (tested by
scp) on 20GB file).

I hope the 20GB size is large enough to alleviate the effects of in
memory file caching and/or network traffic fluctuations.


Single file Test
-----------------------
I started this test by reformatting the DFS, then tweaked the random
writer to run only one mapper so that it produces only 1 file.   Looks
like when writing out only one file, the machine that ran the map gets
twice as much data written on its disk than the other datanodes.   Any
idea why this is happening and how to get a more even distribution?

Just making sure, does 3x replication guarantees that copies of the
data will be kept on 3 different machines?  (hopefully, the DFS is not
assigning 2 copies of the data blocks to that mapper machine).


Multiple files write
---------------------------
Given that the single-file case, the data is not evenly distributed, I
reformat the DFS and reran random writer for 5GB output with 5 mappers
(i.e., 1GB for each mapper).  This took 294secs with no map/reduce
task failures.    Here are some question on this test:

- This run used up 25GB of DFS.   I was expecting 3x replication to
mean only 15GB is used.  What is using the other 10GB?    The 25GB
usage is computed based on name nodes' output on "capacity" and "used"
percentage.  It confirms with the "remaining" stat.  Just in case I am
reading this wrong, the "Live Datanodes" table's "blocks" column adds
up to 288 blocks.  Is the block size the same as the DFS block size
(which is 64MB in my case)?   If so, this means 18GB worth of blocks.
This is closer but the numbers still don't seem to add up (what
happened to the other 3GB of diskspace, and why it doesn't match
"Capacity"/"Remaining"/"Used" stats.

- Including replication, 15GB gets written.   Given that we have 5
machines/disks writing in parallel, each machine is writing at about
10.4MB/sec, which is about 1/7th of raw disk throughput.   Is this
expected, or are there parameters that I can set to improve this?


Sorter
----------
I use the Sorter example given on the 5x1GB files generated by
RandomWriter.  It ran with 80 map tasks and 6 reducers, and took
1345sec to complete, and there are no map/reduce task failures.

Looking more closely, a few questions:
- it ran 10 maps at a time, is there a way to run only 5 maps at a
time (and hopefully the scheduler will be able to schedule 1 map on
each machine accessing only local data).

- a few mapper processed 67MB instead of 64MB.  why?  (I had thought
DFS block size is the limit).

- the fastest map task took 59sec, and the slowest took 139 sec.
If it were purely local disk access, reading and writing <70MB of data
should have taken only a few secs total.   Any idea why there is such
a large performance discrepancy and how to tune this?   Is there a way
to check what % of tasks are working from local disk vs from remote
datanodes?

- The reduce tasks all spent about 760 sec in the shuffle phase.
Since we are sorting 5GB of data on 5 machines, I assume ~4 GB get
transferred over the network to the appropriate reducer (assuming
1/5th of the data stays on the local machine).  So, each machine reads
about 820MB of data to send to other slaves.   And each machine also
receive 820MB of data.   Assuming scp performance, one machine
distributing its data to 4 machines can be done within ~30sec (27sec +
some scp startup overhead).   Suppose we do the copying out
sequentially, this means the shuffling can be done in ~150sec.    This
5x discrepancy is quite a bit larger than expected.

- The reduce's sort phase took 12sec to 43sec.   Since each reduce
task only have 1GB of data, this is running at 24MB/sec to 85MB/sec.
24MB/sec seems reasonable given that even with an in memory sort,
there is one local disk read, and one 3x replicated HDFS write.
However, 85MB/sec seems too fast (faster than raw local disk speed).

-------------------

Finally, I am new to java.  Any suggestions on what's a good profiling
tool to use for Hadoop?  would be nice if the profiler can help
identify cpu/network/disk bottlenecks.

thanks

bwolen

Re: write and sort performance

Posted by Bwolen Yang <wb...@gmail.com>.
> Please try Hadoop 0.13.0.

The multiple random writers case now completes in 278sec (from
294sec).  This diff is not big.  In comparison, the sorter improvement
is more impressive.

For the sorter, now the map phase completes in a tighter range bound
55sec to 79sec (instead of 59sec to 139sec).    This speedup (from
scheduling?) the overall running time significantly (890sec vs
1345sec).   The performance of reduce phase is similar (both shuffle
and sort cases).

Let me summarize my remaining questions in the next email.  Looks like
my original email is way too long to get specific answers.

thanks

bwolen

Hadoop Reduce-dynamic sorting

Posted by yu-yang chen <yy...@doc.ic.ac.uk>.
I am wondering if there exists a sorting function in the reduce  
phase, which will sort all the keys in ascending order, and output as  
a list?

My original function is a point on a graph, and the mapping function  
does some kind of a transformation to the y axis, and the key/value  
pairs is represented as the x-axis/transformed y value, and I would  
like my reducing function to return a sorted list, which includes all  
the transformation with ascending x value...

could someone please tell me if it is possible?

yu-yang

Re: write and sort performance

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Bwolen,

First of all, Hadoop is not optimized for small cluster or small bursts 
of writes/reads. There are some costs (like storing a copy locally and 
copying it locally) that don't have benefits for small clusters with .

You could try using different disks (not just partitions) for tmp 
directory for Maps and for Datanode.

To compare single node write with Hadoop, you should run 'bin/hadoop 
-copyFromLocal - test' and pipe your dd command output there. May be you 
will see 25% of 75MB you saw with native write. That is not unexpected. 
Not sure if you want to know all the details of why it is so. In your 
test you have many other one time costs of starting and stopping jobs etc.

I don't mean to say Hadoop can't do better.. its performance is steadily 
improving. But your expectations for toy application might be off.

If you want to figure out what the problem could be, you could start 
with 'copyFromLocal' example above. Here you need to figure our what 
Datanode process and Hadoop shell are doing at verious time (may be with 
stack traces).

Raghu.

Bwolen Yang wrote:
>> Please try Hadoop 0.13.0.  I don't know whether it will address your
>> concerns, but it should be faster and is much closer to what developers
>> are currently working on.
> 
> ok. It would also be good to see how DFS upgrade go between versions.
> (looks like it got released today.  cool.)
> 
> 
>> For such a small cluster you'd probably be better running the jobtracker
>> and namenode on the same node and gain another slave.
> 
> When namenode and jobtracker were running on the same machine, I
> notice failures due to losing contact with jobtracker.  This is why I
> split the machines.
> 
> With regard to the performance details, it is really independent of
> how many slaves I have.   The test is mainly trying to see how close
> Hadoop compares to single node or scp, and what are the tuning
> parameters to make things run faster.
> 
> Any suggestions on java profiling tools?
> 
> bwolen


Re: write and sort performance

Posted by Bwolen Yang <wb...@gmail.com>.
> Please try Hadoop 0.13.0.  I don't know whether it will address your
> concerns, but it should be faster and is much closer to what developers
> are currently working on.

ok. It would also be good to see how DFS upgrade go between versions.
(looks like it got released today.  cool.)


> For such a small cluster you'd probably be better running the jobtracker
> and namenode on the same node and gain another slave.

When namenode and jobtracker were running on the same machine, I
notice failures due to losing contact with jobtracker.  This is why I
split the machines.

With regard to the performance details, it is really independent of
how many slaves I have.   The test is mainly trying to see how close
Hadoop compares to single node or scp, and what are the tuning
parameters to make things run faster.

Any suggestions on java profiling tools?

bwolen

Re: write and sort performance

Posted by Doug Cutting <cu...@apache.org>.
Bwolen Yang wrote:
> Here is the setup:
> --------------------------
> cluster:
>  hadoop 0.12.3

Please try Hadoop 0.13.0.  I don't know whether it will address your 
concerns, but it should be faster and is much closer to what developers 
are currently working on.

>  jdk 1.6.0_01
>  HDFS file replication factor: 3
>  7 machines total
>    1 machine for namenode
>    1 machine for jobtracker
>    5 other machines for slaves (datanodes / mapreduce)

For such a small cluster you'd probably be better running the jobtracker 
and namenode on the same node and gain another slave.

Doug