You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jason Smith <sm...@gmail.com> on 2010/11/03 20:18:18 UTC

Any projects to help with running MapReduce across physically distributed clusters?

I am looking into the problem of running jobs to generate statistics across
a large data set that would be split into different clusters
geographically.  Each cluster would have a unique piece of the overall data
set, as the network overhead to collocate the data would be too much. I
tried searching around for any tools that might help orchestrate something
like this, but did not find anything. Are there any tools I'm missing that I
should look into to?

Thanks
Jason

Re: Any projects to help with running MapReduce across physically distributed clusters?

Posted by Chris K Wensel <ch...@wensel.net>.
You could easily write Cascading apps that could pull all the data into a single source and perform the processing.

You could also use it to launch jobs in different clusters from a single application (each Flow can be given unique properties causing it to run mr jobs on arbitrary clusters). 

So you can effectively run number crunching remotely on each independent cluster and then have the results pulled down to a single cluster and then loaded into any backend systems. Cascading can coordinate the scheduling of the Flows across clusters (via the Cascade abstraction).

ckw

On Nov 3, 2010, at 12:18 PM, Jason Smith wrote:

> I am looking into the problem of running jobs to generate statistics across
> a large data set that would be split into different clusters
> geographically.  Each cluster would have a unique piece of the overall data
> set, as the network overhead to collocate the data would be too much. I
> tried searching around for any tools that might help orchestrate something
> like this, but did not find anything. Are there any tools I'm missing that I
> should look into to?
> 
> Thanks
> Jason

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading