You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Parker Jones <zo...@hotmail.com> on 2011/09/12 12:09:37 UTC

Is Hadoop the right platform for my HPC application?

Hello all,

I have Hadoop up and running and an embarrassingly parallel problem but can't figure out how to arrange the problem.  My apologies in advance if this is obvious and I'm not getting it.

My HPC application isn't a batch program, but runs in a continuous loop (like a server) *outside* of the Hadoop machines, and it should occasionally farm out a large computation to Hadoop and use the results.  However, all the examples I have come across interact with Hadoop via files and the command line.  (Perhaps I am looking at the wrong places?)

So,
* is Hadoop the right platform for this kind of problem? 
* is it possible to use Hadoop without going through the command line and writing all input data to files?  

If so, could someone point me to some examples and documentation.  I am coding in C/C++ in case that is relevant, but examples in any language should be helpful.

Thanks for any suggestions,
Parker

Re: Is Hadoop the right platform for my HPC application?

Posted by Shevek <sh...@karmasphere.com>.

On 12 September 2011 14:23, Alberto Andreotti <al...@gmail.com>wrote:

> Hi Parker,
>
> I'm also interested in exploring hadoop capabilities for HPC, I've been
> doing some experiments with heat transfer problems. Which workloads are you
> trying?
>

My limited understanding suggests you might also look at Pregel or Giraph
for heat transfer problems?

S.

Re: Is Hadoop the right platform for my HPC application?

Posted by Alberto Andreotti <al...@gmail.com>.

Hi Parker,

I'm also interested in exploring hadoop capabilities for HPC, I've been
doing some experiments with heat transfer problems. Which workloads are you
trying?

Alberto.

On 12 September 2011 07:09, Parker Jones <zo...@hotmail.com> wrote:

>
> Hello all,
>
> I have Hadoop up and running and an embarrassingly parallel problem but
> can't figure out how to arrange the problem.  My apologies in advance if
> this is obvious and I'm not getting it.
>
> My HPC application isn't a batch program, but runs in a continuous loop
> (like a server) *outside* of the Hadoop machines, and it should occasionally
> farm out a large computation to Hadoop and use the results.  However, all
> the examples I have come across interact with Hadoop via files and the
> command line.  (Perhaps I am looking at the wrong places?)
>
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and
> writing all input data to files?
>
> If so, could someone point me to some examples and documentation.  I am
> coding in C/C++ in case that is relevant, but examples in any language
> should be helpful.
>
> Thanks for any suggestions,
> Parker
>
>




-- 
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 54351156526363.
MSN: albertoandreotti@gmail.com
Skype: andreottialberto

Re: Is Hadoop the right platform for my HPC application?

Posted by Mi...@emc.com.

As part of Hamster, I have a prototype of MPI on Yarn running locally that
I plan to contribute as a patch sometime soon. I hope it gets in as part
of official hadoop 0.23. The open MPI part of it will be published as a
patch on github (while I sort out the legal requirements) that will have
to be downloaded and patched to open MPI trunk.

- Milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



On 9/14/11 3:22 AM, "Robert Evans" <ev...@yahoo-inc.com> wrote:

>Another option to think about is that there is a Hamster project (
>MAPREDUCE-2911 <https://issues.apache.org/jira/browse/MAPREDUCE-2911> )
>that will allow OpenMPI to run on a Hadoop Cluster.  It is still very
>preliminary and will probably not be ready until Hadoop 0.23 or 0.24.
>
>There are other processing methodologies being developed to run on top of
>YARN (Which is the resource scheduler put in as part of Hadoop 0.23)
>http://wiki.apache.org/hadoop/PoweredByYarn
>
>So there are even more choices coming depending on your problem.
>
>--Bobby Evans
>
>On 9/13/11 12:54 PM, "Parker Jones" <zo...@hotmail.com> wrote:
>
>
>
>Thank you for the explanations, Bobby.  That helps significantly.
>
>I also read the article below which gave me a better understanding of the
>relative merits of MapReduce/Hadoop vs MPI.  Alberto, you might find it
>useful too.
>http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf
>
>There is even a MapReduce API built on top of MPI developed at Sandia.
>
>So many options to choose from :-)
>
>Cheers,
>Parker
>
>> From: evans@yahoo-inc.com
>> To: common-user@hadoop.apache.org
>> Date: Mon, 12 Sep 2011 14:02:44 -0700
>> Subject: Re: Is Hadoop the right platform for my HPC application?
>>
>> Parker,
>>
>> The hadoop command itself is just a shell script that sets up your
>>classpath and some environment variables for a JVM.  Hadoop provides a
>>java API and you should be able to use to write you application, without
>>dealing with the command line.  That being said there is no Map/Reduce
>>C/C++ API.  There is libhdfs.so that will allow you to read/write HDFS
>>files from a C/C++ program, but it actually launches a JVM behind the
>>scenes to handle the actual requests.
>>
>> As for a way to avoid writing your input data into files, the data has
>>to be distributed to the compute nodes some how.  You could write a
>>custom input format that does not use any input files, and then have it
>>load the data a different way.  I believe that some people do this to
>>load data from MySQL or some other DB for processing.  Similarly you
>>could do something with the output format to put the data someplace else.
>>
>> It is hard to say if Hadoop is the right platform without more
>>information about what you are doing.  Hadoop has been used for lots of
>>embarrassingly parallel problems.  The processing is easy, the real
>>question is where is your data coming from, and where are the results
>>going.  Map/Reduce is fast in part because it tries to reduce data
>>movement and move the computation to the data, not the other way round.
>>Without knowing the expected size of your data or the amount of
>>processing that it will do, it is hard to say.
>>
>> --Bobby Evans
>>
>> On 9/12/11 5:09 AM, "Parker Jones" <zo...@hotmail.com> wrote:
>>
>>
>>
>> Hello all,
>>
>> I have Hadoop up and running and an embarrassingly parallel problem but
>>can't figure out how to arrange the problem.  My apologies in advance if
>>this is obvious and I'm not getting it.
>>
>> My HPC application isn't a batch program, but runs in a continuous loop
>>(like a server) *outside* of the Hadoop machines, and it should
>>occasionally farm out a large computation to Hadoop and use the results.
>> However, all the examples I have come across interact with Hadoop via
>>files and the command line.  (Perhaps I am looking at the wrong places?)
>>
>> So,
>> * is Hadoop the right platform for this kind of problem?
>> * is it possible to use Hadoop without going through the command line
>>and writing all input data to files?
>>
>> If so, could someone point me to some examples and documentation.  I am
>>coding in C/C++ in case that is relevant, but examples in any language
>>should be helpful.
>>
>> Thanks for any suggestions,
>> Parker
>>
>>
>>
>
>

Re: Is Hadoop the right platform for my HPC application?

Posted by Robert Evans <ev...@yahoo-inc.com>.

Another option to think about is that there is a Hamster project ( MAPREDUCE-2911 <https://issues.apache.org/jira/browse/MAPREDUCE-2911> ) that will allow OpenMPI to run on a Hadoop Cluster.  It is still very preliminary and will probably not be ready until Hadoop 0.23 or 0.24.

There are other processing methodologies being developed to run on top of YARN (Which is the resource scheduler put in as part of Hadoop 0.23) http://wiki.apache.org/hadoop/PoweredByYarn

So there are even more choices coming depending on your problem.

--Bobby Evans

On 9/13/11 12:54 PM, "Parker Jones" <zo...@hotmail.com> wrote:



Thank you for the explanations, Bobby.  That helps significantly.

I also read the article below which gave me a better understanding of the relative merits of MapReduce/Hadoop vs MPI.  Alberto, you might find it useful too.
http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf

There is even a MapReduce API built on top of MPI developed at Sandia.

So many options to choose from :-)

Cheers,
Parker

> From: evans@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 12 Sep 2011 14:02:44 -0700
> Subject: Re: Is Hadoop the right platform for my HPC application?
>
> Parker,
>
> The hadoop command itself is just a shell script that sets up your classpath and some environment variables for a JVM.  Hadoop provides a java API and you should be able to use to write you application, without dealing with the command line.  That being said there is no Map/Reduce C/C++ API.  There is libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, but it actually launches a JVM behind the scenes to handle the actual requests.
>
> As for a way to avoid writing your input data into files, the data has to be distributed to the compute nodes some how.  You could write a custom input format that does not use any input files, and then have it load the data a different way.  I believe that some people do this to load data from MySQL or some other DB for processing.  Similarly you could do something with the output format to put the data someplace else.
>
> It is hard to say if Hadoop is the right platform without more information about what you are doing.  Hadoop has been used for lots of embarrassingly parallel problems.  The processing is easy, the real question is where is your data coming from, and where are the results going.  Map/Reduce is fast in part because it tries to reduce data movement and move the computation to the data, not the other way round.  Without knowing the expected size of your data or the amount of processing that it will do, it is hard to say.
>
> --Bobby Evans
>
> On 9/12/11 5:09 AM, "Parker Jones" <zo...@hotmail.com> wrote:
>
>
>
> Hello all,
>
> I have Hadoop up and running and an embarrassingly parallel problem but can't figure out how to arrange the problem.  My apologies in advance if this is obvious and I'm not getting it.
>
> My HPC application isn't a batch program, but runs in a continuous loop (like a server) *outside* of the Hadoop machines, and it should occasionally farm out a large computation to Hadoop and use the results.  However, all the examples I have come across interact with Hadoop via files and the command line.  (Perhaps I am looking at the wrong places?)
>
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and writing all input data to files?
>
> If so, could someone point me to some examples and documentation.  I am coding in C/C++ in case that is relevant, but examples in any language should be helpful.
>
> Thanks for any suggestions,
> Parker
>
>
>

RE: Is Hadoop the right platform for my HPC application?

Posted by Parker Jones <zo...@hotmail.com>.

Thank you for the explanations, Bobby.  That helps significantly.

I also read the article below which gave me a better understanding of the relative merits of MapReduce/Hadoop vs MPI.  Alberto, you might find it useful too.
http://grids.ucs.indiana.edu/ptliupages/publications/CloudsandMR.pdf

There is even a MapReduce API built on top of MPI developed at Sandia.

So many options to choose from :-)

Cheers,
Parker

> From: evans@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 12 Sep 2011 14:02:44 -0700
> Subject: Re: Is Hadoop the right platform for my HPC application?
> 
> Parker,
> 
> The hadoop command itself is just a shell script that sets up your classpath and some environment variables for a JVM.  Hadoop provides a java API and you should be able to use to write you application, without dealing with the command line.  That being said there is no Map/Reduce C/C++ API.  There is libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, but it actually launches a JVM behind the scenes to handle the actual requests.
> 
> As for a way to avoid writing your input data into files, the data has to be distributed to the compute nodes some how.  You could write a custom input format that does not use any input files, and then have it load the data a different way.  I believe that some people do this to load data from MySQL or some other DB for processing.  Similarly you could do something with the output format to put the data someplace else.
> 
> It is hard to say if Hadoop is the right platform without more information about what you are doing.  Hadoop has been used for lots of embarrassingly parallel problems.  The processing is easy, the real question is where is your data coming from, and where are the results going.  Map/Reduce is fast in part because it tries to reduce data movement and move the computation to the data, not the other way round.  Without knowing the expected size of your data or the amount of processing that it will do, it is hard to say.
> 
> --Bobby Evans
> 
> On 9/12/11 5:09 AM, "Parker Jones" <zo...@hotmail.com> wrote:
> 
> 
> 
> Hello all,
> 
> I have Hadoop up and running and an embarrassingly parallel problem but can't figure out how to arrange the problem.  My apologies in advance if this is obvious and I'm not getting it.
> 
> My HPC application isn't a batch program, but runs in a continuous loop (like a server) *outside* of the Hadoop machines, and it should occasionally farm out a large computation to Hadoop and use the results.  However, all the examples I have come across interact with Hadoop via files and the command line.  (Perhaps I am looking at the wrong places?)
> 
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and writing all input data to files?
> 
> If so, could someone point me to some examples and documentation.  I am coding in C/C++ in case that is relevant, but examples in any language should be helpful.
> 
> Thanks for any suggestions,
> Parker
> 
> 
>

Re: Is Hadoop the right platform for my HPC application?

Posted by Robert Evans <ev...@yahoo-inc.com>.

Parker,

The hadoop command itself is just a shell script that sets up your classpath and some environment variables for a JVM. Hadoop provides a java API and you should be able to use to write you application, without dealing with the command line. That being said there is no Map/Reduce C/C++ API. There is libhdfs.so that will allow you to read/write HDFS files from a C/C++ program, but it actually launches a JVM behind the scenes to handle the actual requests.

As for a way to avoid writing your input data into files, the data has to be distributed to the compute nodes some how. You could write a custom input format that does not use any input files, and then have it load the data a different way. I believe that some people do this to load data from MySQL or some other DB for processing. Similarly you could do something with the output format to put the data someplace else.

It is hard to say if Hadoop is the right platform without more information about what you are doing. Hadoop has been used for lots of embarrassingly parallel problems. The processing is easy, the real question is where is your data coming from, and where are the results going. Map/Reduce is fast in part because it tries to reduce data movement and move the computation to the data, not the other way round. Without knowing the expected size of your data or the amount of processing that it will do, it is hard to say.

--Bobby Evans

On 9/12/11 5:09 AM, "Parker Jones" <zo...@hotmail.com> wrote:

Hello all,

I have Hadoop up and running and an embarrassingly parallel problem but can't figure out how to arrange the problem. My apologies in advance if this is obvious and I'm not getting it.

My HPC application isn't a batch program, but runs in a continuous loop (like a server) *outside* of the Hadoop machines, and it should occasionally farm out a large computation to Hadoop and use the results. However, all the examples I have come across interact with Hadoop via files and the command line. (Perhaps I am looking at the wrong places?)

So,
* is Hadoop the right platform for this kind of problem?
* is it possible to use Hadoop without going through the command line and writing all input data to files?

If so, could someone point me to some examples and documentation. I am coding in C/C++ in case that is relevant, but examples in any language should be helpful.

Thanks for any suggestions,
Parker