You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/02/26 02:09:20 UTC

map/reduce across a WAN

Can someone tell me if the following scenario is feasible?  My apologies if this
has already been covered ...

Suppose I have N data centers, each with a Hadoop cluster and each with its
own HDFS file system on the cluster.  So, for writing purposes, if I have 4
data centers, I have 4 HDFS file systems, each in its own data center.  When
writing, no other cluster ever has to look outside of the data center its in.

Now suppose that for full data correlation, I want to map/reduce across data
centers, looking at the N individual clusters as one big HDFS.  First, is this
possible?  That is, can I create write configurations that look at clusters
differently than read (map/reduce) configurations?  Not all map/reduce reads
would be across data centers -- only some.  Second, if this is possible, does
map/reduce push most of the heavy data movement off to each cluster to
minimize the traffic across the intra-data center connections?  That is, would
I be okay with the slower connections between data centers if the map/reduce
jobs are aggregating data down to much smaller final sets?

Any pointers or experience feedback here appreciated.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Re: map/reduce across a WAN

Posted by Ted Dunning <td...@veoh.com>.
If moving data between the data centers is an issue, then you really need to
not just use map-reduce willy nilly between the data centers.

Either reduce the data to summary form in one data center and move the
summary, or just move the data.

At least this way you control how many times it moves and which direction.
Otherwise, you don't and the situation will be even worse.


On 2/25/08 5:50 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Ted Dunning wrote:
> 
>> I think the best way to accomplish this sort of goal is to go ahead and run
>> independent clusters and somehow add the ability to propagate files between
>> clusters.  Then the cross-cluster map-reduce can run in the cluster that has
>> originals or replicas of all of the necessary files.
> 
> The problem is the amount of data.  We're using HDFS because the volume will
> be huge.  On the surface, replicating files across data centers would appear
> to
> take something that's sized to need map/reduce and force it down a serial sort
> of pipe, over a slow connection.   Then we'd have to fan it back out to HDFS
> at
> each replicated site in order to crunch the data.
> 
> That's why I was thinking if map/reduce pushes most of the work off to the
> individual nodes where the data resides, that somehow map/reducing across
> the entire set of boxes in all data centers made more sense -- that would seem
> to reduce the volume of data flowing between data centers.
> 
> Assume that the connections between data centers will be relatively slow, at
> least compared to local gigabit LAN speeds within a data center.


Re: map/reduce across a WAN

Posted by Steve Sapovits <ss...@invitemedia.com>.
Ted Dunning wrote:

> I think the best way to accomplish this sort of goal is to go ahead and run
> independent clusters and somehow add the ability to propagate files between
> clusters.  Then the cross-cluster map-reduce can run in the cluster that has
> originals or replicas of all of the necessary files.

The problem is the amount of data.  We're using HDFS because the volume will
be huge.  On the surface, replicating files across data centers would appear to
take something that's sized to need map/reduce and force it down a serial sort
of pipe, over a slow connection.   Then we'd have to fan it back out to HDFS at
each replicated site in order to crunch the data.

That's why I was thinking if map/reduce pushes most of the work off to the
individual nodes where the data resides, that somehow map/reducing across
the entire set of boxes in all data centers made more sense -- that would seem
to reduce the volume of data flowing between data centers.

Assume that the connections between data centers will be relatively slow, at
least compared to local gigabit LAN speeds within a data center.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Re: map/reduce across a WAN

Posted by Ted Dunning <td...@veoh.com>.
I think the best way to accomplish this sort of goal is to go ahead and run
independent clusters and somehow add the ability to propagate files between
clusters.  Then the cross-cluster map-reduce can run in the cluster that has
originals or replicas of all of the necessary files.

The question then is how you cause replicas to exist.  For many applications
a simple script would suffice.

This has many benefits over cross cluster magic, notably that files would
only be copied once.


On 2/25/08 5:09 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> 
> Can someone tell me if the following scenario is feasible?  My apologies if
> this
> has already been covered ...
> 
> Suppose I have N data centers, each with a Hadoop cluster and each with its
> own HDFS file system on the cluster.  So, for writing purposes, if I have 4
> data centers, I have 4 HDFS file systems, each in its own data center.  When
> writing, no other cluster ever has to look outside of the data center its in.
> 
> Now suppose that for full data correlation, I want to map/reduce across data
> centers, looking at the N individual clusters as one big HDFS.  First, is this
> possible?  That is, can I create write configurations that look at clusters
> differently than read (map/reduce) configurations?  Not all map/reduce reads
> would be across data centers -- only some.  Second, if this is possible, does
> map/reduce push most of the heavy data movement off to each cluster to
> minimize the traffic across the intra-data center connections?  That is, would
> I be okay with the slower connections between data centers if the map/reduce
> jobs are aggregating data down to much smaller final sets?
> 
> Any pointers or experience feedback here appreciated.