You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by elton sky <el...@gmail.com> on 2011/04/30 09:18:34 UTC

questions about hadoop map reduce and compute intensive related applications

I got 2 questions:

1. I am wondering how hadoop MR performs when it runs compute intensive
applications, e.g. Monte carlo method compute PI. There's a example in 0.21,
QuasiMonteCarlo, but that example doesn't use random number and it generates
psudo input upfront. If we use distributed random number generation, then I
guess the performance of hadoop should be similar with some message passing
framework, like MPI. So my guess is by using proper method hadoop would be
good in compute intensive applications compared with MPI.

2. I am looking for some applications, which has large data sets and
requires intensive computation. An application can be divided into a
workflow, including either map reduce operations, and message passing like
operations. For example, in step 1 I use hadoop MR processes 10TB of data
and generates small output, say, 10GB. This 10GB can be fit into memory and
they are better be processed with some interprocess communication, which
will boost the performance. So in step 2 I will use MPI, etc.

Is there any application has this property, perhaps in some scientific
research area? Or it's just alright to use map reduce itself?

Regards,
Elton

Re: questions about hadoop map reduce and compute intensive related applications

Posted by elton sky <el...@gmail.com>.

thanks gmackey,

There is an project out of Sandia National Lab that puts MR and MPI together
> in a library if you're interested -->
> http://www.sandia.gov/~sjplimp/mapreduce.html

That is a implementation of MR using MPI. I saw that as well but haven't
tried it out.

I am actually looking at programming model level integration. You know, for
some applications, you can take advantage of high through put from MR and
message passing from MPI.

-Elton

On Tue, May 3, 2011 at 12:55 AM, <gm...@cs.ucf.edu> wrote:

>
>  Integrating MPI with map-reduce is currently difficult and/or very ugly,
>> however. Not impossible and there are hackish ways to do the job, but they
>> are hacks.
>>
>
> There is an project out of Sandia National Lab that puts MR and MPI
> together in a library if you're interested -->
> http://www.sandia.gov/~sjplimp/mapreduce.html
>
> The project isn't mature yet, and I haven't actually used it myself.
>
> --
> --
> Grant Mackey
> PhD student Computer Engineering
> University of Central Florida
>

Re: questions about hadoop map reduce and compute intensive related applications

Posted by gm...@cs.ucf.edu.

 
> Integrating MPI with map-reduce is currently difficult and/or very ugly, 
> however. Not impossible and there are hackish ways to do the job, but 
> they are hacks.

There is an project out of Sandia National Lab that puts MR and MPI 
together in a library if you're interested --> 
http://www.sandia.gov/~sjplimp/mapreduce.html

The project isn't mature yet, and I haven't actually used it myself.

-- 
--
Grant Mackey
PhD student Computer Engineering
University of Central Florida

Re: questions about hadoop map reduce and compute intensive related applications

Posted by gm...@cs.ucf.edu.

 
> Integrating MPI with map-reduce is currently difficult and/or very ugly, 
> however. Not impossible and there are hackish ways to do the job, but 
> they are hacks.

There is an project out of Sandia National Lab that puts MR and MPI 
together in a library if you're interested --> 
http://www.sandia.gov/~sjplimp/mapreduce.html

The project isn't mature yet, and I haven't actually used it myself.

-- 
--
Grant Mackey
PhD student Computer Engineering
University of Central Florida

Re: questions about hadoop map reduce and compute intensive related applications

Posted by elton sky <el...@gmail.com>.

Ted,

MPI supports node-to-node communications in ways that map-reduce does not,
> however, which requires that you iterate map-reduce steps for many
> algorithms.   With Hadoop's current implementation, this is horrendously
> slow (minimum 20-30 seconds per iteration).
>
> Sometimes you can avoid this by clever tricks.  For instance, random
> projection can compute the key step in an SVD decomposition with one
> map-reduce while the comparable Lanczos algorithm requires more than one
> step per eigenvector (and we often want 100 of them!).
>
> Sometimes, however, there are no known algorithms that avoid the need for
> repeated communication.  For these problems, Hadoop as it stands may be a
> poor fit.  Help is on the way, however, with the MapReduce 2.0 work because
> that will allow much more flexible models of computation.


For applications requiring iterative regression, there is an extension for
hadoop, called HaLoop. HaLoop takes advantage of invariant part of input. It
stores them on local disk of reduces to avoid repeated computation against
same data. Another one,Twister, uses long running maps and reduces, and
makes map handle same part of invariant input each iteration.
Neither of them uses interprocess communication. Because main performance
benefits of both is from caching invariant input.

Some machine learning algorithms require features that are much smaller than
> the original input.  This leads to exactly the pattern you describe.
>  Integrating MPI with map-reduce is currently difficult and/or very ugly,
> however.  Not impossible and there are hackish ways to do the job, but they
> are hacks.


 As I am not familiar with applications in machine learning, can you give
specific examples I can look into? For opportunities of integrating message
passing, I'm looking for:
Apps has big data and complex computation. Input data can be manipulated by
map reduce at first, then a message passing model is better to be used for
computation. Or vise versa. It may have multiple steps which builds a
workflow.

-Elton

Re: questions about hadoop map reduce and compute intensive related applications

Posted by Ted Dunning <td...@maprtech.com>.

On Sat, Apr 30, 2011 at 12:18 AM, elton sky <el...@gmail.com> wrote:

> I got 2 questions:
>
> 1. I am wondering how hadoop MR performs when it runs compute intensive
> applications, e.g. Monte carlo method compute PI. There's a example in
> 0.21,
> QuasiMonteCarlo, but that example doesn't use random number and it
> generates
> psudo input upfront. If we use distributed random number generation, then I
> guess the performance of hadoop should be similar with some message passing
> framework, like MPI. So my guess is by using proper method hadoop would be
> good in compute intensive applications compared with MPI.
>

Not quite sure what algorithms you mean here, but for trivial parallelism,
map-reduce is a fine way to go.

MPI supports node-to-node communications in ways that map-reduce does not,
however, which requires that you iterate map-reduce steps for many
algorithms.   With Hadoop's current implementation, this is horrendously
slow (minimum 20-30 seconds per iteration).

Sometimes you can avoid this by clever tricks.  For instance, random
projection can compute the key step in an SVD decomposition with one
map-reduce while the comparable Lanczos algorithm requires more than one
step per eigenvector (and we often want 100 of them!).

Sometimes, however, there are no known algorithms that avoid the need for
repeated communication.  For these problems, Hadoop as it stands may be a
poor fit.  Help is on the way, however, with the MapReduce 2.0 work because
that will allow much more flexible models of computation.

> 2. I am looking for some applications, which has large data sets and
> requires intensive computation. An application can be divided into a
> workflow, including either map reduce operations, and message passing like
> operations. For example, in step 1 I use hadoop MR processes 10TB of data
> and generates small output, say, 10GB. This 10GB can be fit into memory and
> they are better be processed with some interprocess communication, which
> will boost the performance. So in step 2 I will use MPI, etc.
>

Some machine learning algorithms require features that are much smaller than
the original input.  This leads to exactly the pattern you describe.
 Integrating MPI with map-reduce is currently difficult and/or very ugly,
however.  Not impossible and there are hackish ways to do the job, but they
are hacks.

Re: questions about hadoop map reduce and compute intensive related applications

Posted by Ted Dunning <td...@maprtech.com>.

On Sat, Apr 30, 2011 at 12:18 AM, elton sky <el...@gmail.com> wrote:

> I got 2 questions:
>
> 1. I am wondering how hadoop MR performs when it runs compute intensive
> applications, e.g. Monte carlo method compute PI. There's a example in
> 0.21,
> QuasiMonteCarlo, but that example doesn't use random number and it
> generates
> psudo input upfront. If we use distributed random number generation, then I
> guess the performance of hadoop should be similar with some message passing
> framework, like MPI. So my guess is by using proper method hadoop would be
> good in compute intensive applications compared with MPI.
>

Not quite sure what algorithms you mean here, but for trivial parallelism,
map-reduce is a fine way to go.

MPI supports node-to-node communications in ways that map-reduce does not,
however, which requires that you iterate map-reduce steps for many
algorithms.   With Hadoop's current implementation, this is horrendously
slow (minimum 20-30 seconds per iteration).

Sometimes you can avoid this by clever tricks.  For instance, random
projection can compute the key step in an SVD decomposition with one
map-reduce while the comparable Lanczos algorithm requires more than one
step per eigenvector (and we often want 100 of them!).

Sometimes, however, there are no known algorithms that avoid the need for
repeated communication.  For these problems, Hadoop as it stands may be a
poor fit.  Help is on the way, however, with the MapReduce 2.0 work because
that will allow much more flexible models of computation.

> 2. I am looking for some applications, which has large data sets and
> requires intensive computation. An application can be divided into a
> workflow, including either map reduce operations, and message passing like
> operations. For example, in step 1 I use hadoop MR processes 10TB of data
> and generates small output, say, 10GB. This 10GB can be fit into memory and
> they are better be processed with some interprocess communication, which
> will boost the performance. So in step 2 I will use MPI, etc.
>

Some machine learning algorithms require features that are much smaller than
the original input.  This leads to exactly the pattern you describe.
 Integrating MPI with map-reduce is currently difficult and/or very ugly,
however.  Not impossible and there are hackish ways to do the job, but they
are hacks.