You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Merto Mertek <ma...@gmail.com> on 2012/05/17 21:11:52 UTC

Hadoop-on-demand and torque

If I understand it right HOD is mentioned mainly for merging existing HPC
clusters with hadoop and for testing purposes..

I cannot find what is the role of Torque here (just initial nodes
allocation?) and which is the default scheduler of HOD ?  Probably the
scheduler from the hadoop distribution?

In the doc is mentioned a MAUI scheduler, but probably if there would be an
integration with hadoop there will be any document on it..

thanks..

Re: Hadoop-on-demand and torque

Posted by Ralph Castain <rh...@open-mpi.org>.

OMPI is a performance-focused community, so we always compare things :-)

Some initial data against YARN, but not Mesos. Someone has been looking at porting OMPI to Mesos, but it turns out that Mesos isn't a particularly friendly MPI platform (a couple of us have been trying to provide advice on how to overcome the obstacles). I'm not sure what his plans are for completing that work - we haven't heard from him for a few weeks.

In terms of YARN, the OMPI-based "HOD" solution launches an MPI program about 1000x faster, and runs about 10x faster. The launch time differences grows with scale as the YARN MPI solution wires up with a quadratic time signature, while the OMPI solution wires up logarithmically. The execution time difference depends upon the application (IO bound vs compute bound), but largely stems from a difference in available data transports.

As a practical example, running a simple MPI "ring" program takes about 90 seconds on an 8 node system using YARN, and about 35 milliseconds using OMPI under SLURM. An MR word count program that looked at 1000 files took about 6 minutes using YARN, and about 11 seconds using OMPI's MR+.

Non-MPI programs also tend to launch faster due to the difference in how YARN handles launch vs other RMs. Again, a non-MPI "hello" running on an 8 node system can still take 20 seconds to run, depending on the heartbeat setting, and about 25 milliseconds using SLURM. You don't get the wireup impact, of course, so the time difference remains fairly consistent with scale.

This is inline with what others have reported, so I think the results (although preliminary) are consistent with the findings reported elsewhere.

We'll have to wait to see about Mesos.

On May 21, 2012, at 8:45 AM, Charles Earl wrote:

> Ralph,
> Do you have any YARN or Mesos performance comparison against HOD? I suppose since it was customer requirement you might not have explored it. MPI support seems to be active issue for Mesos now.
> Charles
> 
> On May 21, 2012, at 10:36 AM, Ralph Castain <rh...@open-mpi.org> wrote:
> 
>> Not quite yet, though we are working on it (some descriptive stuff is around, but needs to be consolidated). Several of us started working together a couple of months ago to support the MapReduce programming model on HPC clusters using Open MPI as the platform. In working with our customers and OMPI's wide community of users, we found that people were interested in this capability, wanted to integrate MPI support into their MapReduce jobs, and didn't want to migrate their clusters to YARN for various reasons.
>> 
>> We have released initial versions of two new tools in the OMPI developer's trunk, scheduled for inclusion in the upcoming 1.7.0 release:
>> 
>> 1. "mr+" - executes the MapReduce programming paradigm. Currently, we only support streaming data, though we will extend that support shortly. All HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. Both mappers and reducers can utilize MPI (independently or in combination) if they so choose. Mappers and reducers can be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes with Java MPI bindings).
>> 
>> 2. "hdfsalloc" - takes a list of files and obtains a resource allocation for the nodes upon which those files reside. SLURM and Moab/Maui are currently supported, with Gridengine coming soon.
>> 
>> There will be a public announcement of this in the near future, and we expect to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of months. By the end of this summer, we should have a full-featured public release.
>> 
>> 
>> On May 20, 2012, at 2:10 PM, Brian Bockelman wrote:
>> 
>>> Hi Ralph,
>>> 
>>> I admit - I've only been half-following the OpenMPI progress.  Do you have a technical write-up of what has been done?
>>> 
>>> Thanks,
>>> 
>>> Brian
>>> 
>>> On May 20, 2012, at 9:31 AM, Ralph Castain wrote:
>>> 
>>>> FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon.
>>>> 
>>>> If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it.
>>>> 
>>>> 
>>>> On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
>>>> 
>>>>> We run similar infrastructure in a university project.. we plan to install
>>>>> hadoop.. and looking for "alternatives" based on hadoop in case the pure
>>>>> hadoop is not working as expected.
>>>>> 
>>>>> Keep us updated on the code release.
>>>>> 
>>>>> Best,
>>>>> PA
>>>>> 
>>>>> 2012/5/20 Stijn De Weirdt <st...@ugent.be>
>>>>> 
>>>>>> hi all,
>>>>>> 
>>>>>> i'm part of an HPC group of a university, and we have some users that are
>>>>>> interested in Hadoop to see if it can be useful in their research and we
>>>>>> also have researchers that are using hadoop already on their own
>>>>>> infrastructure, but that is is not enough reason for us to start with
>>>>>> dedicated dedicated Hadoop infrastructure  (we are now only running torque
>>>>>> based clusters with and without shared storage; setting up and properly
>>>>>> maintaining Hadoop infrastructure requires quite some understanding of new
>>>>>> software)
>>>>>> 
>>>>>> to be able to support these needs we wanted to do just this: use current
>>>>>> HPC infrastructure to make private hadoop clusters so people can do some
>>>>>> work. if we attract enough interest, we will probably setup dedicated
>>>>>> infrastructure, but by that time we (the admins) will also have a better
>>>>>> understanding of what is required.
>>>>>> 
>>>>>> so we used to look at HOD for testing/running hadoop on existing
>>>>>> infrastructure (never really looked at myhadoop though).
>>>>>> but (imho) the current HOD code base is not in such a good state. we did
>>>>>> some work to get it working and added some features, to come to the
>>>>>> conclusion that it was not sufficient (and not maintainable).
>>>>>> 
>>>>>> so we wrote something from scratch with same functionality as HOD, and
>>>>>> much more (eg HBase is now possible, with or without MR1; some default
>>>>>> tuning; easy to add support for yarn instead of MR1).
>>>>>> it has some suport for torque, but my laptop is also sufficient. (the
>>>>>> torque support is a wrapper to submit the job)
>>>>>> we gave a workshop on hadoop using it (25 people, and each with their own
>>>>>> 5 node hadoop cluster) and it went rather well.
>>>>>> 
>>>>>> it's not in a public repo yet, but we could do that. if interested, let me
>>>>>> know, and i see what can be done. (releasing the code is on our todo list,
>>>>>> but if there is some demand, we can do it sooner)
>>>>>> 
>>>>>> 
>>>>>> stijn
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>>>>>> 
>>>>>>> I am also interested to learn about myHadoop as I use a shared storage
>>>>>>> system and everything runs on VMs and not actual dedicated servers.
>>>>>>> 
>>>>>>> in like amazon EC2 environment which you just have VMs and huge central
>>>>>>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>>>>>>> parallelize algorithms, or is better to go with other technologies?
>>>>>>> 
>>>>>>> 2012/5/18 Manu S<ma...@gmail.com>
>>>>>>> 
>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>>>>>>> which
>>>>>>>> needs to run map-reduce jobs.
>>>>>>>> 
>>>>>>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>>>>>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>>>>>>> come the
>>>>>>>> integration of shared-architecture(HPC)&  shared-nothing
>>>>>>>> 
>>>>>>>> architecture(Hadoop).
>>>>>>>> 
>>>>>>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>>>>>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>>>>>>> cluster?
>>>>>>>> 
>>>>>>>> Appreciate your comments on the same.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Manu S
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> If I understand it right HOD is mentioned mainly for merging existing
>>>>>>>>> HPC
>>>>>>>>> clusters with hadoop and for testing purposes..
>>>>>>>>> 
>>>>>>>>> I cannot find what is the role of Torque here (just initial nodes
>>>>>>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>>>>>>> scheduler from the hadoop distribution?
>>>>>>>>> 
>>>>>>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>>>>>> 
>>>>>>>> an
>>>>>>>> 
>>>>>>>>> integration with hadoop there will be any document on it..
>>>>>>>>> 
>>>>>>>>> thanks..
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>>

Re: Hadoop-on-demand and torque

Posted by Charles Earl <ch...@gmail.com>.

Ralph,
Do you have any YARN or Mesos performance comparison against HOD? I suppose since it was customer requirement you might not have explored it. MPI support seems to be active issue for Mesos now.
Charles

On May 21, 2012, at 10:36 AM, Ralph Castain <rh...@open-mpi.org> wrote:

> Not quite yet, though we are working on it (some descriptive stuff is around, but needs to be consolidated). Several of us started working together a couple of months ago to support the MapReduce programming model on HPC clusters using Open MPI as the platform. In working with our customers and OMPI's wide community of users, we found that people were interested in this capability, wanted to integrate MPI support into their MapReduce jobs, and didn't want to migrate their clusters to YARN for various reasons.
> 
> We have released initial versions of two new tools in the OMPI developer's trunk, scheduled for inclusion in the upcoming 1.7.0 release:
> 
> 1. "mr+" - executes the MapReduce programming paradigm. Currently, we only support streaming data, though we will extend that support shortly. All HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. Both mappers and reducers can utilize MPI (independently or in combination) if they so choose. Mappers and reducers can be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes with Java MPI bindings).
> 
> 2. "hdfsalloc" - takes a list of files and obtains a resource allocation for the nodes upon which those files reside. SLURM and Moab/Maui are currently supported, with Gridengine coming soon.
> 
> There will be a public announcement of this in the near future, and we expect to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of months. By the end of this summer, we should have a full-featured public release.
> 
> 
> On May 20, 2012, at 2:10 PM, Brian Bockelman wrote:
> 
>> Hi Ralph,
>> 
>> I admit - I've only been half-following the OpenMPI progress.  Do you have a technical write-up of what has been done?
>> 
>> Thanks,
>> 
>> Brian
>> 
>> On May 20, 2012, at 9:31 AM, Ralph Castain wrote:
>> 
>>> FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon.
>>> 
>>> If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it.
>>> 
>>> 
>>> On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
>>> 
>>>> We run similar infrastructure in a university project.. we plan to install
>>>> hadoop.. and looking for "alternatives" based on hadoop in case the pure
>>>> hadoop is not working as expected.
>>>> 
>>>> Keep us updated on the code release.
>>>> 
>>>> Best,
>>>> PA
>>>> 
>>>> 2012/5/20 Stijn De Weirdt <st...@ugent.be>
>>>> 
>>>>> hi all,
>>>>> 
>>>>> i'm part of an HPC group of a university, and we have some users that are
>>>>> interested in Hadoop to see if it can be useful in their research and we
>>>>> also have researchers that are using hadoop already on their own
>>>>> infrastructure, but that is is not enough reason for us to start with
>>>>> dedicated dedicated Hadoop infrastructure  (we are now only running torque
>>>>> based clusters with and without shared storage; setting up and properly
>>>>> maintaining Hadoop infrastructure requires quite some understanding of new
>>>>> software)
>>>>> 
>>>>> to be able to support these needs we wanted to do just this: use current
>>>>> HPC infrastructure to make private hadoop clusters so people can do some
>>>>> work. if we attract enough interest, we will probably setup dedicated
>>>>> infrastructure, but by that time we (the admins) will also have a better
>>>>> understanding of what is required.
>>>>> 
>>>>> so we used to look at HOD for testing/running hadoop on existing
>>>>> infrastructure (never really looked at myhadoop though).
>>>>> but (imho) the current HOD code base is not in such a good state. we did
>>>>> some work to get it working and added some features, to come to the
>>>>> conclusion that it was not sufficient (and not maintainable).
>>>>> 
>>>>> so we wrote something from scratch with same functionality as HOD, and
>>>>> much more (eg HBase is now possible, with or without MR1; some default
>>>>> tuning; easy to add support for yarn instead of MR1).
>>>>> it has some suport for torque, but my laptop is also sufficient. (the
>>>>> torque support is a wrapper to submit the job)
>>>>> we gave a workshop on hadoop using it (25 people, and each with their own
>>>>> 5 node hadoop cluster) and it went rather well.
>>>>> 
>>>>> it's not in a public repo yet, but we could do that. if interested, let me
>>>>> know, and i see what can be done. (releasing the code is on our todo list,
>>>>> but if there is some demand, we can do it sooner)
>>>>> 
>>>>> 
>>>>> stijn
>>>>> 
>>>>> 
>>>>> 
>>>>> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>>>>> 
>>>>>> I am also interested to learn about myHadoop as I use a shared storage
>>>>>> system and everything runs on VMs and not actual dedicated servers.
>>>>>> 
>>>>>> in like amazon EC2 environment which you just have VMs and huge central
>>>>>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>>>>>> parallelize algorithms, or is better to go with other technologies?
>>>>>> 
>>>>>> 2012/5/18 Manu S<ma...@gmail.com>
>>>>>> 
>>>>>> Hi All,
>>>>>>> 
>>>>>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>>>>>> which
>>>>>>> needs to run map-reduce jobs.
>>>>>>> 
>>>>>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>>>>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>>>>>> come the
>>>>>>> integration of shared-architecture(HPC)&  shared-nothing
>>>>>>> 
>>>>>>> architecture(Hadoop).
>>>>>>> 
>>>>>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>>>>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>>>>>> cluster?
>>>>>>> 
>>>>>>> Appreciate your comments on the same.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Manu S
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> If I understand it right HOD is mentioned mainly for merging existing
>>>>>>>> HPC
>>>>>>>> clusters with hadoop and for testing purposes..
>>>>>>>> 
>>>>>>>> I cannot find what is the role of Torque here (just initial nodes
>>>>>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>>>>>> scheduler from the hadoop distribution?
>>>>>>>> 
>>>>>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>>>>> 
>>>>>>> an
>>>>>>> 
>>>>>>>> integration with hadoop there will be any document on it..
>>>>>>>> 
>>>>>>>> thanks..
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>> 
>

Re: Hadoop-on-demand and torque

Posted by Ralph Castain <rh...@open-mpi.org>.

Not quite yet, though we are working on it (some descriptive stuff is around, but needs to be consolidated). Several of us started working together a couple of months ago to support the MapReduce programming model on HPC clusters using Open MPI as the platform. In working with our customers and OMPI's wide community of users, we found that people were interested in this capability, wanted to integrate MPI support into their MapReduce jobs, and didn't want to migrate their clusters to YARN for various reasons.

We have released initial versions of two new tools in the OMPI developer's trunk, scheduled for inclusion in the upcoming 1.7.0 release:

1. "mr+" - executes the MapReduce programming paradigm. Currently, we only support streaming data, though we will extend that support shortly. All HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. Both mappers and reducers can utilize MPI (independently or in combination) if they so choose. Mappers and reducers can be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes with Java MPI bindings).

2. "hdfsalloc" - takes a list of files and obtains a resource allocation for the nodes upon which those files reside. SLURM and Moab/Maui are currently supported, with Gridengine coming soon.

There will be a public announcement of this in the near future, and we expect to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of months. By the end of this summer, we should have a full-featured public release.


On May 20, 2012, at 2:10 PM, Brian Bockelman wrote:

> Hi Ralph,
> 
> I admit - I've only been half-following the OpenMPI progress.  Do you have a technical write-up of what has been done?
> 
> Thanks,
> 
> Brian
> 
> On May 20, 2012, at 9:31 AM, Ralph Castain wrote:
> 
>> FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon.
>> 
>> If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it.
>> 
>> 
>> On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
>> 
>>> We run similar infrastructure in a university project.. we plan to install
>>> hadoop.. and looking for "alternatives" based on hadoop in case the pure
>>> hadoop is not working as expected.
>>> 
>>> Keep us updated on the code release.
>>> 
>>> Best,
>>> PA
>>> 
>>> 2012/5/20 Stijn De Weirdt <st...@ugent.be>
>>> 
>>>> hi all,
>>>> 
>>>> i'm part of an HPC group of a university, and we have some users that are
>>>> interested in Hadoop to see if it can be useful in their research and we
>>>> also have researchers that are using hadoop already on their own
>>>> infrastructure, but that is is not enough reason for us to start with
>>>> dedicated dedicated Hadoop infrastructure  (we are now only running torque
>>>> based clusters with and without shared storage; setting up and properly
>>>> maintaining Hadoop infrastructure requires quite some understanding of new
>>>> software)
>>>> 
>>>> to be able to support these needs we wanted to do just this: use current
>>>> HPC infrastructure to make private hadoop clusters so people can do some
>>>> work. if we attract enough interest, we will probably setup dedicated
>>>> infrastructure, but by that time we (the admins) will also have a better
>>>> understanding of what is required.
>>>> 
>>>> so we used to look at HOD for testing/running hadoop on existing
>>>> infrastructure (never really looked at myhadoop though).
>>>> but (imho) the current HOD code base is not in such a good state. we did
>>>> some work to get it working and added some features, to come to the
>>>> conclusion that it was not sufficient (and not maintainable).
>>>> 
>>>> so we wrote something from scratch with same functionality as HOD, and
>>>> much more (eg HBase is now possible, with or without MR1; some default
>>>> tuning; easy to add support for yarn instead of MR1).
>>>> it has some suport for torque, but my laptop is also sufficient. (the
>>>> torque support is a wrapper to submit the job)
>>>> we gave a workshop on hadoop using it (25 people, and each with their own
>>>> 5 node hadoop cluster) and it went rather well.
>>>> 
>>>> it's not in a public repo yet, but we could do that. if interested, let me
>>>> know, and i see what can be done. (releasing the code is on our todo list,
>>>> but if there is some demand, we can do it sooner)
>>>> 
>>>> 
>>>> stijn
>>>> 
>>>> 
>>>> 
>>>> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>>>> 
>>>>> I am also interested to learn about myHadoop as I use a shared storage
>>>>> system and everything runs on VMs and not actual dedicated servers.
>>>>> 
>>>>> in like amazon EC2 environment which you just have VMs and huge central
>>>>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>>>>> parallelize algorithms, or is better to go with other technologies?
>>>>> 
>>>>> 2012/5/18 Manu S<ma...@gmail.com>
>>>>> 
>>>>> Hi All,
>>>>>> 
>>>>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>>>>> which
>>>>>> needs to run map-reduce jobs.
>>>>>> 
>>>>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>>>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>>>>> come the
>>>>>> integration of shared-architecture(HPC)&  shared-nothing
>>>>>> 
>>>>>> architecture(Hadoop).
>>>>>> 
>>>>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>>>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>>>>> cluster?
>>>>>> 
>>>>>> Appreciate your comments on the same.
>>>>>> 
>>>>>> Thanks,
>>>>>> Manu S
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> If I understand it right HOD is mentioned mainly for merging existing
>>>>>>> HPC
>>>>>>> clusters with hadoop and for testing purposes..
>>>>>>> 
>>>>>>> I cannot find what is the role of Torque here (just initial nodes
>>>>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>>>>> scheduler from the hadoop distribution?
>>>>>>> 
>>>>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>>>> 
>>>>>> an
>>>>>> 
>>>>>>> integration with hadoop there will be any document on it..
>>>>>>> 
>>>>>>> thanks..
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>

Re: Hadoop-on-demand and torque

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hi Ralph,

I admit - I've only been half-following the OpenMPI progress.  Do you have a technical write-up of what has been done?

Thanks,

Brian

On May 20, 2012, at 9:31 AM, Ralph Castain wrote:

> FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon.
> 
> If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it.
> 
> 
> On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
> 
>> We run similar infrastructure in a university project.. we plan to install
>> hadoop.. and looking for "alternatives" based on hadoop in case the pure
>> hadoop is not working as expected.
>> 
>> Keep us updated on the code release.
>> 
>> Best,
>> PA
>> 
>> 2012/5/20 Stijn De Weirdt <st...@ugent.be>
>> 
>>> hi all,
>>> 
>>> i'm part of an HPC group of a university, and we have some users that are
>>> interested in Hadoop to see if it can be useful in their research and we
>>> also have researchers that are using hadoop already on their own
>>> infrastructure, but that is is not enough reason for us to start with
>>> dedicated dedicated Hadoop infrastructure  (we are now only running torque
>>> based clusters with and without shared storage; setting up and properly
>>> maintaining Hadoop infrastructure requires quite some understanding of new
>>> software)
>>> 
>>> to be able to support these needs we wanted to do just this: use current
>>> HPC infrastructure to make private hadoop clusters so people can do some
>>> work. if we attract enough interest, we will probably setup dedicated
>>> infrastructure, but by that time we (the admins) will also have a better
>>> understanding of what is required.
>>> 
>>> so we used to look at HOD for testing/running hadoop on existing
>>> infrastructure (never really looked at myhadoop though).
>>> but (imho) the current HOD code base is not in such a good state. we did
>>> some work to get it working and added some features, to come to the
>>> conclusion that it was not sufficient (and not maintainable).
>>> 
>>> so we wrote something from scratch with same functionality as HOD, and
>>> much more (eg HBase is now possible, with or without MR1; some default
>>> tuning; easy to add support for yarn instead of MR1).
>>> it has some suport for torque, but my laptop is also sufficient. (the
>>> torque support is a wrapper to submit the job)
>>> we gave a workshop on hadoop using it (25 people, and each with their own
>>> 5 node hadoop cluster) and it went rather well.
>>> 
>>> it's not in a public repo yet, but we could do that. if interested, let me
>>> know, and i see what can be done. (releasing the code is on our todo list,
>>> but if there is some demand, we can do it sooner)
>>> 
>>> 
>>> stijn
>>> 
>>> 
>>> 
>>> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>>> 
>>>> I am also interested to learn about myHadoop as I use a shared storage
>>>> system and everything runs on VMs and not actual dedicated servers.
>>>> 
>>>> in like amazon EC2 environment which you just have VMs and huge central
>>>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>>>> parallelize algorithms, or is better to go with other technologies?
>>>> 
>>>> 2012/5/18 Manu S<ma...@gmail.com>
>>>> 
>>>> Hi All,
>>>>> 
>>>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>>>> which
>>>>> needs to run map-reduce jobs.
>>>>> 
>>>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>>>> come the
>>>>> integration of shared-architecture(HPC)&  shared-nothing
>>>>> 
>>>>> architecture(Hadoop).
>>>>> 
>>>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>>>> cluster?
>>>>> 
>>>>> Appreciate your comments on the same.
>>>>> 
>>>>> Thanks,
>>>>> Manu S
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> If I understand it right HOD is mentioned mainly for merging existing
>>>>>> HPC
>>>>>> clusters with hadoop and for testing purposes..
>>>>>> 
>>>>>> I cannot find what is the role of Torque here (just initial nodes
>>>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>>>> scheduler from the hadoop distribution?
>>>>>> 
>>>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>>> 
>>>>> an
>>>>> 
>>>>>> integration with hadoop there will be any document on it..
>>>>>> 
>>>>>> thanks..
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>

Re: Hadoop-on-demand and torque

Posted by Ralph Castain <rh...@open-mpi.org>.

FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under any HPC environment. We don't have the Java integration yet to support the Hadoop MR class, but you can write a mapper/reducer and execute that programming paradigm. We plan to integrate the Hadoop MR class soon.

If you already have that integration, we'd love to help port it over. We already have the MPI support completed, so any mapper/reducer could use it.


On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:

> We run similar infrastructure in a university project.. we plan to install
> hadoop.. and looking for "alternatives" based on hadoop in case the pure
> hadoop is not working as expected.
> 
> Keep us updated on the code release.
> 
> Best,
> PA
> 
> 2012/5/20 Stijn De Weirdt <st...@ugent.be>
> 
>> hi all,
>> 
>> i'm part of an HPC group of a university, and we have some users that are
>> interested in Hadoop to see if it can be useful in their research and we
>> also have researchers that are using hadoop already on their own
>> infrastructure, but that is is not enough reason for us to start with
>> dedicated dedicated Hadoop infrastructure  (we are now only running torque
>> based clusters with and without shared storage; setting up and properly
>> maintaining Hadoop infrastructure requires quite some understanding of new
>> software)
>> 
>> to be able to support these needs we wanted to do just this: use current
>> HPC infrastructure to make private hadoop clusters so people can do some
>> work. if we attract enough interest, we will probably setup dedicated
>> infrastructure, but by that time we (the admins) will also have a better
>> understanding of what is required.
>> 
>> so we used to look at HOD for testing/running hadoop on existing
>> infrastructure (never really looked at myhadoop though).
>> but (imho) the current HOD code base is not in such a good state. we did
>> some work to get it working and added some features, to come to the
>> conclusion that it was not sufficient (and not maintainable).
>> 
>> so we wrote something from scratch with same functionality as HOD, and
>> much more (eg HBase is now possible, with or without MR1; some default
>> tuning; easy to add support for yarn instead of MR1).
>> it has some suport for torque, but my laptop is also sufficient. (the
>> torque support is a wrapper to submit the job)
>> we gave a workshop on hadoop using it (25 people, and each with their own
>> 5 node hadoop cluster) and it went rather well.
>> 
>> it's not in a public repo yet, but we could do that. if interested, let me
>> know, and i see what can be done. (releasing the code is on our todo list,
>> but if there is some demand, we can do it sooner)
>> 
>> 
>> stijn
>> 
>> 
>> 
>> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>> 
>>> I am also interested to learn about myHadoop as I use a shared storage
>>> system and everything runs on VMs and not actual dedicated servers.
>>> 
>>> in like amazon EC2 environment which you just have VMs and huge central
>>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>>> parallelize algorithms, or is better to go with other technologies?
>>> 
>>> 2012/5/18 Manu S<ma...@gmail.com>
>>> 
>>> Hi All,
>>>> 
>>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>>> which
>>>> needs to run map-reduce jobs.
>>>> 
>>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>>> come the
>>>> integration of shared-architecture(HPC)&  shared-nothing
>>>> 
>>>> architecture(Hadoop).
>>>> 
>>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>>> cluster?
>>>> 
>>>> Appreciate your comments on the same.
>>>> 
>>>> Thanks,
>>>> Manu S
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>>> wrote:
>>>> 
>>>> If I understand it right HOD is mentioned mainly for merging existing
>>>>> HPC
>>>>> clusters with hadoop and for testing purposes..
>>>>> 
>>>>> I cannot find what is the role of Torque here (just initial nodes
>>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>>> scheduler from the hadoop distribution?
>>>>> 
>>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>> 
>>>> an
>>>> 
>>>>> integration with hadoop there will be any document on it..
>>>>> 
>>>>> thanks..
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Hadoop-on-demand and torque

Posted by Pierre Antoine DuBoDeNa <pa...@gmail.com>.

We run similar infrastructure in a university project.. we plan to install
hadoop.. and looking for "alternatives" based on hadoop in case the pure
hadoop is not working as expected.

Keep us updated on the code release.

Best,
PA

2012/5/20 Stijn De Weirdt <st...@ugent.be>

> hi all,
>
> i'm part of an HPC group of a university, and we have some users that are
> interested in Hadoop to see if it can be useful in their research and we
> also have researchers that are using hadoop already on their own
> infrastructure, but that is is not enough reason for us to start with
> dedicated dedicated Hadoop infrastructure  (we are now only running torque
> based clusters with and without shared storage; setting up and properly
> maintaining Hadoop infrastructure requires quite some understanding of new
> software)
>
> to be able to support these needs we wanted to do just this: use current
> HPC infrastructure to make private hadoop clusters so people can do some
> work. if we attract enough interest, we will probably setup dedicated
> infrastructure, but by that time we (the admins) will also have a better
> understanding of what is required.
>
> so we used to look at HOD for testing/running hadoop on existing
> infrastructure (never really looked at myhadoop though).
> but (imho) the current HOD code base is not in such a good state. we did
> some work to get it working and added some features, to come to the
> conclusion that it was not sufficient (and not maintainable).
>
> so we wrote something from scratch with same functionality as HOD, and
> much more (eg HBase is now possible, with or without MR1; some default
> tuning; easy to add support for yarn instead of MR1).
> it has some suport for torque, but my laptop is also sufficient. (the
> torque support is a wrapper to submit the job)
> we gave a workshop on hadoop using it (25 people, and each with their own
> 5 node hadoop cluster) and it went rather well.
>
> it's not in a public repo yet, but we could do that. if interested, let me
> know, and i see what can be done. (releasing the code is on our todo list,
> but if there is some demand, we can do it sooner)
>
>
> stijn
>
>
>
> On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
>
>> I am also interested to learn about myHadoop as I use a shared storage
>> system and everything runs on VMs and not actual dedicated servers.
>>
>> in like amazon EC2 environment which you just have VMs and huge central
>> storage, is it any helpful to use hadoop to distribute jobs and maybe
>> parallelize algorithms, or is better to go with other technologies?
>>
>> 2012/5/18 Manu S<ma...@gmail.com>
>>
>>  Hi All,
>>>
>>> Guess HOD could be useful existing HPC cluster with Torque scheduler
>>> which
>>> needs to run map-reduce jobs.
>>>
>>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>>> resources*will support many HPC schedulers like SGE, PBS etc to over
>>> come the
>>> integration of shared-architecture(HPC)&  shared-nothing
>>>
>>> architecture(Hadoop).
>>>
>>> Any real use case scenarios for integrating hadoop map/reduce in existing
>>> HPC cluster and what are the advantages of using hadoop features in HPC
>>> cluster?
>>>
>>> Appreciate your comments on the same.
>>>
>>> Thanks,
>>> Manu S
>>>
>>>
>>>
>>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>>> wrote:
>>>
>>>  If I understand it right HOD is mentioned mainly for merging existing
>>>> HPC
>>>> clusters with hadoop and for testing purposes..
>>>>
>>>> I cannot find what is the role of Torque here (just initial nodes
>>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>>> scheduler from the hadoop distribution?
>>>>
>>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>>>>
>>> an
>>>
>>>> integration with hadoop there will be any document on it..
>>>>
>>>> thanks..
>>>>
>>>>
>>>
>>
>

Re: Hadoop-on-demand and torque

Posted by Stijn De Weirdt <st...@ugent.be>.

hi all,

i'm part of an HPC group of a university, and we have some users that 
are interested in Hadoop to see if it can be useful in their research 
and we also have researchers that are using hadoop already on their own 
infrastructure, but that is is not enough reason for us to start with 
dedicated dedicated Hadoop infrastructure  (we are now only running 
torque based clusters with and without shared storage; setting up and 
properly maintaining Hadoop infrastructure requires quite some 
understanding of new software)

to be able to support these needs we wanted to do just this: use current 
HPC infrastructure to make private hadoop clusters so people can do some 
work. if we attract enough interest, we will probably setup dedicated 
infrastructure, but by that time we (the admins) will also have a better 
understanding of what is required.

so we used to look at HOD for testing/running hadoop on existing 
infrastructure (never really looked at myhadoop though).
but (imho) the current HOD code base is not in such a good state. we did 
some work to get it working and added some features, to come to the 
conclusion that it was not sufficient (and not maintainable).

so we wrote something from scratch with same functionality as HOD, and 
much more (eg HBase is now possible, with or without MR1; some default 
tuning; easy to add support for yarn instead of MR1).
it has some suport for torque, but my laptop is also sufficient. (the 
torque support is a wrapper to submit the job)
we gave a workshop on hadoop using it (25 people, and each with their 
own 5 node hadoop cluster) and it went rather well.

it's not in a public repo yet, but we could do that. if interested, let 
me know, and i see what can be done. (releasing the code is on our todo 
list, but if there is some demand, we can do it sooner)

stijn

On 05/18/2012 05:07 PM, Pierre Antoine DuBoDeNa wrote:
> I am also interested to learn about myHadoop as I use a shared storage
> system and everything runs on VMs and not actual dedicated servers.
>
> in like amazon EC2 environment which you just have VMs and huge central
> storage, is it any helpful to use hadoop to distribute jobs and maybe
> parallelize algorithms, or is better to go with other technologies?
>
> 2012/5/18 Manu S<ma...@gmail.com>
>
>> Hi All,
>>
>> Guess HOD could be useful existing HPC cluster with Torque scheduler which
>> needs to run map-reduce jobs.
>>
>> Also read about *myHadoop- Hadoop on demand on traditional HPC
>> resources*will support many HPC schedulers like SGE, PBS etc to over
>> come the
>> integration of shared-architecture(HPC)&  shared-nothing
>> architecture(Hadoop).
>>
>> Any real use case scenarios for integrating hadoop map/reduce in existing
>> HPC cluster and what are the advantages of using hadoop features in HPC
>> cluster?
>>
>> Appreciate your comments on the same.
>>
>> Thanks,
>> Manu S
>>
>>
>>
>> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek<ma...@gmail.com>
>> wrote:
>>
>>> If I understand it right HOD is mentioned mainly for merging existing HPC
>>> clusters with hadoop and for testing purposes..
>>>
>>> I cannot find what is the role of Torque here (just initial nodes
>>> allocation?) and which is the default scheduler of HOD ?  Probably the
>>> scheduler from the hadoop distribution?
>>>
>>> In the doc is mentioned a MAUI scheduler, but probably if there would be
>> an
>>> integration with hadoop there will be any document on it..
>>>
>>> thanks..
>>>
>>
>

Re: Hadoop-on-demand and torque

Posted by Pierre Antoine DuBoDeNa <pa...@gmail.com>.

I am also interested to learn about myHadoop as I use a shared storage
system and everything runs on VMs and not actual dedicated servers.

in like amazon EC2 environment which you just have VMs and huge central
storage, is it any helpful to use hadoop to distribute jobs and maybe
parallelize algorithms, or is better to go with other technologies?

2012/5/18 Manu S <ma...@gmail.com>

> Hi All,
>
> Guess HOD could be useful existing HPC cluster with Torque scheduler which
> needs to run map-reduce jobs.
>
> Also read about *myHadoop- Hadoop on demand on traditional HPC
> resources*will support many HPC schedulers like SGE, PBS etc to over
> come the
> integration of shared-architecture(HPC) & shared-nothing
> architecture(Hadoop).
>
> Any real use case scenarios for integrating hadoop map/reduce in existing
> HPC cluster and what are the advantages of using hadoop features in HPC
> cluster?
>
> Appreciate your comments on the same.
>
> Thanks,
> Manu S
>
>
>
> On Fri, May 18, 2012 at 12:41 AM, Merto Mertek <ma...@gmail.com>
> wrote:
>
> > If I understand it right HOD is mentioned mainly for merging existing HPC
> > clusters with hadoop and for testing purposes..
> >
> > I cannot find what is the role of Torque here (just initial nodes
> > allocation?) and which is the default scheduler of HOD ?  Probably the
> > scheduler from the hadoop distribution?
> >
> > In the doc is mentioned a MAUI scheduler, but probably if there would be
> an
> > integration with hadoop there will be any document on it..
> >
> > thanks..
> >
>

Re: Hadoop-on-demand and torque

Posted by Manu S <ma...@gmail.com>.

Hi All,

Guess HOD could be useful existing HPC cluster with Torque scheduler which
needs to run map-reduce jobs.

Also read about *myHadoop- Hadoop on demand on traditional HPC
resources*will support many HPC schedulers like SGE, PBS etc to over
come the
integration of shared-architecture(HPC) & shared-nothing
architecture(Hadoop).

Any real use case scenarios for integrating hadoop map/reduce in existing
HPC cluster and what are the advantages of using hadoop features in HPC
cluster?

Appreciate your comments on the same.

Thanks,
Manu S

On Fri, May 18, 2012 at 12:41 AM, Merto Mertek <ma...@gmail.com> wrote:

> If I understand it right HOD is mentioned mainly for merging existing HPC
> clusters with hadoop and for testing purposes..
>
> I cannot find what is the role of Torque here (just initial nodes
> allocation?) and which is the default scheduler of HOD ?  Probably the
> scheduler from the hadoop distribution?
>
> In the doc is mentioned a MAUI scheduler, but probably if there would be an
> integration with hadoop there will be any document on it..
>
> thanks..
>