You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bwolen Yang <wb...@gmail.com> on 2007/06/09 03:56:45 UTC

performace questions

Here is a summary of my remaining questions from the [write and sort
performance] thread.

- Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of
raw disk space (based on block counts exported from namenode).
Accounting for 3x replication, I was expecting 15GB. what's causing
this 20% overhead?

- when large amount of data is written to HFS (for example
copyFromLocal), are the file block replication pipelined?  Also, does
one 64MB block needs to be fully replicated before the next 64MB copy
can start?

- is there a way to control how many mappers are actively running at a
time?  i.e., I would like to try matching the number of running
mappers to the number of slaves to see individual mapper's performance
without interference.

- is there a way to force each mapper to process only 64MB of data?
Some were processing 67MB during a sort.

- what's the file access pattern for the mapper when the data is
local?  I sort of expect reading 1 local 64MB file and possibly
writing out R local files each with 64/R MB worth of data, where R is
the number of reducers.   Is this wrong?   I haven't seen a mapper
task that run close to this fast.

(shuffle question probably shares some answers with the mapper
question... so, omit for now).

- I did copyFromLocal test (dd | bin/hadoop dfs -copyFromLocal)
suggested by Raghu.  Both 1GB test shows the performance of 9.2MB/sec
(for 2GB copy, it is around 8.3MB/sec).   This is consistent to
earlier random writer result (10.4MB/sec).    So, it is only around
11-14% of raw disk performance.

bwolen

Re: performace questions

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
 >  - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
 > the performance problem to how datanode writes data (even to itself).

On Hadoop, most of the delay you are seeing for 1 replica test with one 
node, is because of this: It first writes 64MB to local tmp file, then 
it sends that 64MB file over (local) ethernet to DataNode on the same 
node before starting to write next 64MB. Writing to tmp file and sending 
to DataNode is *not* pipelined.

Disk b/w is not always equal to raw serial read/write bandwidth you get 
on a fresh partition with large disk (In fact 75MBps sounds pretty large 
what kind of disk is it? Is it a raid? or 10K rpm disk?)

I would suggest a simple exercise: Write 20GB file with dd as you 
initially did that gave you 75 MBps. Now read this file and write 
another 20GB at the same time. Do you see 38MBps for each of read and 
write? You mostly won't. Where is the missing inefficiency? You could 
repeat this on a partition that is 80% full. There more factors that 
affect disk performance other than raw serial read/write b/w. Most 
important of them being disk seeks.

This is not Hadoop related and Hadoop inefficiencies are not necessarily 
for the same reason.

Also, 30MBps you tested your network is most likely limited by ssh 
processing in scp than the b/w of the network. How can you confirm it?

Raghu.

Bwolen Yang wrote:
> Raghu,
> 
> The 1 replica and "du" suggestions are good.  thank you.
> 
> To further reduce the variables, I also tried 1 replica/1 slave case.
> (namenode and jobtracker are still on their own machines.)
> 
> - randomwriter:
>  - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
> the performance problem to how datanode writes data (even to itself).
> 
>  - 1 replica / 5 slave case's running time is 1/4th of 3 replica
> case.  Perfect scaling would have been 1/3rd.  So, there is a 33%
> additional performance overhead lost to replication (beyond writing 3x
> as much data).
> 
> 
>>  - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB....
> 
> Turned out there are a few blocks that are only a few k.  "du" is the
> right tool.  The actual raw disk overhead only 1%.  thanks.
> 
> 
>> You are asuming each block is is 64M. There are some blocks for "CRC
>> files". Did you try to du the datanode's 'data directories'?
> 
> All blk_* files are 64MB or less.
> 
> However, some mappers still show it is accessing
>                part-0:1006632960+70663780
> where 70663780 is about 67MB.   Hmm... looks like it is only doing so
> at the last block.  I guess that's not too bad.
> 
> 
>> They are pipelined.
> 
> you're right :).   the slowness exists even in single slave / single
> replica case.
> 
> thanks
> 
> bwolen


Re: performace questions

Posted by Bwolen Yang <wb...@gmail.com>.
Raghu,

The 1 replica and "du" suggestions are good.  thank you.

To further reduce the variables, I also tried 1 replica/1 slave case.
(namenode and jobtracker are still on their own machines.)

- randomwriter:
  - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
the performance problem to how datanode writes data (even to itself).

  - 1 replica / 5 slave case's running time is 1/4th of 3 replica
case.  Perfect scaling would have been 1/3rd.  So, there is a 33%
additional performance overhead lost to replication (beyond writing 3x
as much data).


>  - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB....

Turned out there are a few blocks that are only a few k.  "du" is the
right tool.  The actual raw disk overhead only 1%.  thanks.


> You are asuming each block is is 64M. There are some blocks for "CRC
> files". Did you try to du the datanode's 'data directories'?

All blk_* files are 64MB or less.

However, some mappers still show it is accessing
                part-0:1006632960+70663780
where 70663780 is about 67MB.   Hmm... looks like it is only doing so
at the last block.  I guess that's not too bad.


> They are pipelined.

you're right :).   the slowness exists even in single slave / single
replica case.

thanks

bwolen

Re: performace questions

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Your interest is good. I think you should ask even smaller number of 
questions in one mail and try to do more experimentation.

Bwolen Yang wrote:
> Here is a summary of my remaining questions from the [write and sort
> performance] thread.
> 
> - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of
> raw disk space (based on block counts exported from namenode).
> Accounting for 3x replication, I was expecting 15GB. what's causing
> this 20% overhead?

You are asuming each block is is 64M. There are some blocks for "CRC 
files". Did you try to du the datanode's 'data directories'?

> - when large amount of data is written to HFS (for example
> copyFromLocal), are the file block replication pipelined?  Also, does
> one 64MB block needs to be fully replicated before the next 64MB copy
> can start?

They are pipelined. Again you can experiment by trying with single 
replica (in config) and see if runs much faster. If it does not, then 
they should be pipelined.

Raghu.