You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by Danushka Menikkumbura <da...@gmail.com> on 2013/02/25 22:48:01 UTC

Airavata/Hadoop Integration

Hi Devs,

I am looking into extending Big Data capabilities of Airavata as my M.Sc.
research work. I have identified certain possibilities and am going to
start with integrating Apache Hadoop (and Hadoop-like frameworks) with
Airavata.

According to what I have understood, the best approach would be to have a
new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
have a new parameter in the ApplicationContext (say TargetApplication) to
define the target application type and resolve correct provider in the GFac
Scheduler based on that. I see that having this capability in the Scheduler
class is already a TODO. I have been able to do these changes locally and
invoke a simple Hadoop job using GFac. Thus, I can assure that this
approach is viable except for any other implication that I am missing.

I think we can store Hadoop job definitions in the Airavata Registry where
each definition would essentially include a unique identifier and other
attributes like mapper, reducer, sorter, formaters, etc that can be defined
using XBaya. Information about these building blocks could be loaded from
XML meta data files (of a known format) included in jar files. It should
also be possible to compose Hadoop job "chains" using XBaya. So, what we
specify in the application context would be the target application type
(say Hadoop), job/chain id, input file location and the output file
location. In addition I am thinking of having job monitoring support based
on constructs provided by the Hadoop API (that I have already looked into)
and data querying based on Apache Hive/Pig.

Furthermore, apart from Hadoop there are two other similar frameworks that
look quite promising.

1. Sector/Sphere

Sector/Sphere [1] is an open source software framework for high-performance
distributed data storage and processing. It is comparable with Apache
HDFS/Hadoop. Sector is a distributed file system and Sphere is the
programming framework that supports massive in-storage parallel data
processing on data stored in Sector. The key motive is that Sector/Sphere
is claimed to be about 2 - 4 times faster than Hadoop.

2. Hyracks

Hyracks [2] is another framework for data-intensive computing that is
roughly in the same space as Apache Hadoop. It has support for composing
and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].

I am yet to look into the API's of these two frameworks but they should
ideally work with the same GFac implementation that I have proposed for
Hadoop.

I would strongly appreciate your feedback on this approach. Also pros and
cons of using Sector/Sphere or Hyracks if you have any experience with them
already.

[1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
Many-Task Computing on Grids and Supercomputers, 2009, p. 3.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
flexible and extensible foundation for data-intensive computing,” in Data
Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
1151–1162.

[3] http://asterix.ics.uci.edu/

Thanks,
Danushka

Re: Airavata/Hadoop Integration

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Hi Danushka,

I'm on it right now, will fnish in couple of hours.

Lahiru

On Tue, Feb 26, 2013 at 10:23 AM, Suresh Marru <sm...@apache.org> wrote:

> On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:
>
> > Hi Danushka,
> >
> > I think we already have a provider to handle Hadoop jobs which uses
> Apache
> > Whirr to setup the Hadoop cluster and submit the job.
>
> I think this Lahiru is referring to the GSOC projects -
> https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/
>
> Suresh
>
> >
> > We still didn't port this code to Airavata, once I do will send an email
> to
> > the list.
> >
> > Regards
> > Lahiru
> >
> > On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
> > danushka.menikkumbura@gmail.com> wrote:
> >
> >> Hi Devs,
> >>
> >> I am looking into extending Big Data capabilities of Airavata as my
> M.Sc.
> >> research work. I have identified certain possibilities and am going to
> >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> >> Airavata.
> >>
> >> According to what I have understood, the best approach would be to have
> a
> >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We
> can
> >> have a new parameter in the ApplicationContext (say TargetApplication)
> to
> >> define the target application type and resolve correct provider in the
> GFac
> >> Scheduler based on that. I see that having this capability in the
> Scheduler
> >> class is already a TODO. I have been able to do these changes locally
> and
> >> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> >> approach is viable except for any other implication that I am missing.
> >>
> >> I think we can store Hadoop job definitions in the Airavata Registry
> where
> >> each definition would essentially include a unique identifier and other
> >> attributes like mapper, reducer, sorter, formaters, etc that can be
> defined
> >> using XBaya. Information about these building blocks could be loaded
> from
> >> XML meta data files (of a known format) included in jar files. It should
> >> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> >> specify in the application context would be the target application type
> >> (say Hadoop), job/chain id, input file location and the output file
> >> location. In addition I am thinking of having job monitoring support
> based
> >> on constructs provided by the Hadoop API (that I have already looked
> into)
> >> and data querying based on Apache Hive/Pig.
> >>
> >> Furthermore, apart from Hadoop there are two other similar frameworks
> that
> >> look quite promising.
> >>
> >> 1. Sector/Sphere
> >>
> >> Sector/Sphere [1] is an open source software framework for
> high-performance
> >> distributed data storage and processing. It is comparable with Apache
> >> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> >> programming framework that supports massive in-storage parallel data
> >> processing on data stored in Sector. The key motive is that
> Sector/Sphere
> >> is claimed to be about 2 - 4 times faster than Hadoop.
> >>
> >> 2. Hyracks
> >>
> >> Hyracks [2] is another framework for data-intensive computing that is
> >> roughly in the same space as Apache Hadoop. It has support for composing
> >> and executing native Hyracks jobs plus running Hadoop jobs in the
> Hyracks
> >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
> >>
> >> I am yet to look into the API's of these two frameworks but they should
> >> ideally work with the same GFac implementation that I have proposed for
> >> Hadoop.
> >>
> >> I would strongly appreciate your feedback on this approach. Also pros
> and
> >> cons of using Sector/Sphere or Hyracks if you have any experience with
> them
> >> already.
> >>
> >> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> >> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
> >>
> >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks:
> A
> >> flexible and extensible foundation for data-intensive computing,” in
> Data
> >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011,
> pp.
> >> 1151–1162.
> >>
> >> [3] http://asterix.ics.uci.edu/
> >>
> >> Thanks,
> >> Danushka
> >>
> >
> >
> >
> > --
> > System Analyst Programmer
> > PTI Lab
> > Indiana University
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Airavata/Hadoop Integration

Posted by Suresh Marru <sm...@apache.org>.

On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:

> Hi Danushka,
> 
> I think we already have a provider to handle Hadoop jobs which uses Apache
> Whirr to setup the Hadoop cluster and submit the job.

I think this Lahiru is referring to the GSOC projects - https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/

Suresh

> 
> We still didn't port this code to Airavata, once I do will send an email to
> the list.
> 
> Regards
> Lahiru
> 
> On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
> danushka.menikkumbura@gmail.com> wrote:
> 
>> Hi Devs,
>> 
>> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
>> research work. I have identified certain possibilities and am going to
>> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
>> Airavata.
>> 
>> According to what I have understood, the best approach would be to have a
>> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
>> have a new parameter in the ApplicationContext (say TargetApplication) to
>> define the target application type and resolve correct provider in the GFac
>> Scheduler based on that. I see that having this capability in the Scheduler
>> class is already a TODO. I have been able to do these changes locally and
>> invoke a simple Hadoop job using GFac. Thus, I can assure that this
>> approach is viable except for any other implication that I am missing.
>> 
>> I think we can store Hadoop job definitions in the Airavata Registry where
>> each definition would essentially include a unique identifier and other
>> attributes like mapper, reducer, sorter, formaters, etc that can be defined
>> using XBaya. Information about these building blocks could be loaded from
>> XML meta data files (of a known format) included in jar files. It should
>> also be possible to compose Hadoop job "chains" using XBaya. So, what we
>> specify in the application context would be the target application type
>> (say Hadoop), job/chain id, input file location and the output file
>> location. In addition I am thinking of having job monitoring support based
>> on constructs provided by the Hadoop API (that I have already looked into)
>> and data querying based on Apache Hive/Pig.
>> 
>> Furthermore, apart from Hadoop there are two other similar frameworks that
>> look quite promising.
>> 
>> 1. Sector/Sphere
>> 
>> Sector/Sphere [1] is an open source software framework for high-performance
>> distributed data storage and processing. It is comparable with Apache
>> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
>> programming framework that supports massive in-storage parallel data
>> processing on data stored in Sector. The key motive is that Sector/Sphere
>> is claimed to be about 2 - 4 times faster than Hadoop.
>> 
>> 2. Hyracks
>> 
>> Hyracks [2] is another framework for data-intensive computing that is
>> roughly in the same space as Apache Hadoop. It has support for composing
>> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
>> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>> 
>> I am yet to look into the API's of these two frameworks but they should
>> ideally work with the same GFac implementation that I have proposed for
>> Hadoop.
>> 
>> I would strongly appreciate your feedback on this approach. Also pros and
>> cons of using Sector/Sphere or Hyracks if you have any experience with them
>> already.
>> 
>> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
>> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
>> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>> 
>> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
>> flexible and extensible foundation for data-intensive computing,” in Data
>> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
>> 1151–1162.
>> 
>> [3] http://asterix.ics.uci.edu/
>> 
>> Thanks,
>> Danushka
>> 
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University