You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rui Shi <sh...@yahoo.com> on 2007/12/05 01:12:31 UTC
Some performance observation
Hi,
I tried running some jobs in hadoop. I have the following setup:
- The input has about 500 gzipped files (~10MB each).
- I have 8 machines in the cluster.
- The job is simply extracting certain field from the each line of the input then aggregate.
- It takes about 40mins to finish the job (~1min 40 sec per map task).
My question is that the similar ad hoc query running over NFS takes about 5 mins. Can anybody explain to me why it is and what should I do to improve it?
Thanks,
Rui
----- Original Message ----
From: Doug Cutting <cu...@apache.org>
To: hadoop-user@lucene.apache.org
Sent: Tuesday, December 4, 2007 2:17:26 PM
Subject: Re: Question about reduce copy speed.
Jason Venner wrote:
> When my reduce is running, on the status page I see the following for
> the incomplete reduce's
>
> reduce > copy (643 of 789 at 0.12 MB/s) >
Reducers cannot copy any faster than mappers can generate output. When
all maps are complete, how long does it take before copying is
complete?
If that delay is small, then copying is keeping up with map output.
> Is that the actual transfer rate between machines, or is that a
> misleading number?
It's the rate that a given reduce task is able to get output. If
you're
running multiple reduce tasks per node, then that node's rate will be
higher. As mentioned above, it's limited by the rate that maps
generate
output. And copying competes with map input for disk and network
bandwidth.
Doug
____________________________________________________________________________________
Be a better pen pal.
Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/