You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rui Shi <sh...@yahoo.com> on 2007/12/05 01:12:31 UTC

Some performance observation

Hi,

I tried running some jobs in hadoop. I have the following setup:
 
 - The input has about 500 gzipped files (~10MB each). 
 - I have 8 machines in the cluster. 
 - The job is simply extracting certain field from the each line of the input then aggregate.
 - It takes about 40mins to finish the job (~1min 40 sec per map task).

My question is that the similar ad hoc query running over NFS takes about 5 mins. Can anybody explain to me why it is and what should I do to improve it?

Thanks,

Rui

----- Original Message ----
From: Doug Cutting <cu...@apache.org>
To: hadoop-user@lucene.apache.org
Sent: Tuesday, December 4, 2007 2:17:26 PM
Subject: Re: Question about reduce copy speed.


Jason Venner wrote:
> When my reduce is running, on the status page I see the following for
 
> the incomplete reduce's
> 
> reduce > copy (643 of 789 at 0.12 MB/s) >

Reducers cannot copy any faster than mappers can generate output.  When
 
all maps are complete, how long does it take before copying is
 complete? 
  If that delay is small, then copying is keeping up with map output.

> Is that the actual transfer rate between machines, or is that a 
> misleading number?

It's the rate that a given reduce task is able to get output.  If
 you're 
running multiple reduce tasks per node, then that node's rate will be 
higher.  As mentioned above, it's limited by the rate that maps
 generate 
output.  And copying competes with map input for disk and network
 bandwidth.

Doug






      ____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/