You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Gopal Gandhi <go...@yahoo.com> on 2008/07/23 22:31:56 UTC
Fw: question on HDFS
Hi folks,
Does anybody has a comment on that? Why we let reducer fetch local data through HTTP not SSH?
----- Forwarded Message ----
From: Gopal Gandhi <go...@yahoo.com>
To: core-user@hadoop.apache.org
Cc: acm@yahoo-inc.com
Sent: Tuesday, July 22, 2008 6:30:49 PM
Subject: Re: question on HDFS
That's interesting. Why letting reducer fetch local data through HTTP not SSH?
----- Original Message ----
From: Arun C Murthy <ac...@yahoo-inc.com>
To: core-user@hadoop.apache.org
Sent: Tuesday, July 22, 2008 2:19:36 PM
Subject: Re: question on HDFS
Mori,
On Jul 22, 2008, at 12:22 PM, Mori Bellamy wrote:
> hey all,
> let us say that i have 3 boxes, A B and C. initially, map tasks are
> running on all 3. after most of the mapping is done, C is 32% done
> with reduce (so still copying stuff to its local disk) and A is
> stuck on a particularly long map-task (it got an ill-behaved record
> from the input splits). does A's intermediate map output data go
> directly to C's local disk, or is it still written to HDFS and
> therefore distributed amongst all the machines? also, will A's disk
> be a favored target for A's output bytes, or is the target volume
> independent of the corresponding mapper?
>
Intermediate outputs (i.e. map outputs) are written to the local disk
and not to HDFS. The reduce fetches the intermediate outputs via HTTP.
hth,
Arun
> Thanks! The answer to this question should clear a lot of things up
> for me.
Re: Fw: question on HDFS
Posted by Steve Loughran <st...@apache.org>.
Gopal Gandhi wrote:
> Hi folks,
>
> Does anybody has a comment on that? Why we let reducer fetch local data through HTTP not SSH?
>
presumably because its way more efficent. And, being fairly stateless,
scales well. SSH has an expensive connection overhead as well as the
encrypt/decrypt costs.