You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by Danushka Menikkumbura <da...@gmail.com> on 2013/02/25 22:48:01 UTC

Airavata/Hadoop Integration

Hi Devs,

I am looking into extending Big Data capabilities of Airavata as my M.Sc.
research work. I have identified certain possibilities and am going to
start with integrating Apache Hadoop (and Hadoop-like frameworks) with
Airavata.

According to what I have understood, the best approach would be to have a
new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
have a new parameter in the ApplicationContext (say TargetApplication) to
define the target application type and resolve correct provider in the GFac
Scheduler based on that. I see that having this capability in the Scheduler
class is already a TODO. I have been able to do these changes locally and
invoke a simple Hadoop job using GFac. Thus, I can assure that this
approach is viable except for any other implication that I am missing.

I think we can store Hadoop job definitions in the Airavata Registry where
each definition would essentially include a unique identifier and other
attributes like mapper, reducer, sorter, formaters, etc that can be defined
using XBaya. Information about these building blocks could be loaded from
XML meta data files (of a known format) included in jar files. It should
also be possible to compose Hadoop job "chains" using XBaya. So, what we
specify in the application context would be the target application type
(say Hadoop), job/chain id, input file location and the output file
location. In addition I am thinking of having job monitoring support based
on constructs provided by the Hadoop API (that I have already looked into)
and data querying based on Apache Hive/Pig.

Furthermore, apart from Hadoop there are two other similar frameworks that
look quite promising.

1. Sector/Sphere

Sector/Sphere [1] is an open source software framework for high-performance
distributed data storage and processing. It is comparable with Apache
HDFS/Hadoop. Sector is a distributed file system and Sphere is the
programming framework that supports massive in-storage parallel data
processing on data stored in Sector. The key motive is that Sector/Sphere
is claimed to be about 2 - 4 times faster than Hadoop.

2. Hyracks

Hyracks [2] is another framework for data-intensive computing that is
roughly in the same space as Apache Hadoop. It has support for composing
and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].

I am yet to look into the API's of these two frameworks but they should
ideally work with the same GFac implementation that I have proposed for
Hadoop.

I would strongly appreciate your feedback on this approach. Also pros and
cons of using Sector/Sphere or Hyracks if you have any experience with them
already.

[1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
Many-Task Computing on Grids and Supercomputers, 2009, p. 3.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
flexible and extensible foundation for data-intensive computing,” in Data
Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
1151–1162.

[3] http://asterix.ics.uci.edu/

Thanks,
Danushka

Re: Airavata/Hadoop Integration

Posted by Lahiru Gunathilake <gl...@gmail.com>.
Hi Danushka,

I'm on it right now, will fnish in couple of hours.

Lahiru

On Tue, Feb 26, 2013 at 10:23 AM, Suresh Marru <sm...@apache.org> wrote:

> On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:
>
> > Hi Danushka,
> >
> > I think we already have a provider to handle Hadoop jobs which uses
> Apache
> > Whirr to setup the Hadoop cluster and submit the job.
>
> I think this Lahiru is referring to the GSOC projects -
> https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/
>
> Suresh
>
> >
> > We still didn't port this code to Airavata, once I do will send an email
> to
> > the list.
> >
> > Regards
> > Lahiru
> >
> > On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
> > danushka.menikkumbura@gmail.com> wrote:
> >
> >> Hi Devs,
> >>
> >> I am looking into extending Big Data capabilities of Airavata as my
> M.Sc.
> >> research work. I have identified certain possibilities and am going to
> >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> >> Airavata.
> >>
> >> According to what I have understood, the best approach would be to have
> a
> >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We
> can
> >> have a new parameter in the ApplicationContext (say TargetApplication)
> to
> >> define the target application type and resolve correct provider in the
> GFac
> >> Scheduler based on that. I see that having this capability in the
> Scheduler
> >> class is already a TODO. I have been able to do these changes locally
> and
> >> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> >> approach is viable except for any other implication that I am missing.
> >>
> >> I think we can store Hadoop job definitions in the Airavata Registry
> where
> >> each definition would essentially include a unique identifier and other
> >> attributes like mapper, reducer, sorter, formaters, etc that can be
> defined
> >> using XBaya. Information about these building blocks could be loaded
> from
> >> XML meta data files (of a known format) included in jar files. It should
> >> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> >> specify in the application context would be the target application type
> >> (say Hadoop), job/chain id, input file location and the output file
> >> location. In addition I am thinking of having job monitoring support
> based
> >> on constructs provided by the Hadoop API (that I have already looked
> into)
> >> and data querying based on Apache Hive/Pig.
> >>
> >> Furthermore, apart from Hadoop there are two other similar frameworks
> that
> >> look quite promising.
> >>
> >> 1. Sector/Sphere
> >>
> >> Sector/Sphere [1] is an open source software framework for
> high-performance
> >> distributed data storage and processing. It is comparable with Apache
> >> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> >> programming framework that supports massive in-storage parallel data
> >> processing on data stored in Sector. The key motive is that
> Sector/Sphere
> >> is claimed to be about 2 - 4 times faster than Hadoop.
> >>
> >> 2. Hyracks
> >>
> >> Hyracks [2] is another framework for data-intensive computing that is
> >> roughly in the same space as Apache Hadoop. It has support for composing
> >> and executing native Hyracks jobs plus running Hadoop jobs in the
> Hyracks
> >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
> >>
> >> I am yet to look into the API's of these two frameworks but they should
> >> ideally work with the same GFac implementation that I have proposed for
> >> Hadoop.
> >>
> >> I would strongly appreciate your feedback on this approach. Also pros
> and
> >> cons of using Sector/Sphere or Hyracks if you have any experience with
> them
> >> already.
> >>
> >> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> >> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
> >>
> >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks:
> A
> >> flexible and extensible foundation for data-intensive computing,” in
> Data
> >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011,
> pp.
> >> 1151–1162.
> >>
> >> [3] http://asterix.ics.uci.edu/
> >>
> >> Thanks,
> >> Danushka
> >>
> >
> >
> >
> > --
> > System Analyst Programmer
> > PTI Lab
> > Indiana University
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Airavata/Hadoop Integration

Posted by Suresh Marru <sm...@apache.org>.
On Feb 26, 2013, at 7:04 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:

> Hi Danushka,
> 
> I think we already have a provider to handle Hadoop jobs which uses Apache
> Whirr to setup the Hadoop cluster and submit the job.

I think this Lahiru is referring to the GSOC projects - https://code.google.com/a/apache-extras.org/p/airavata-gsoc-sandbox/

Suresh

> 
> We still didn't port this code to Airavata, once I do will send an email to
> the list.
> 
> Regards
> Lahiru
> 
> On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
> danushka.menikkumbura@gmail.com> wrote:
> 
>> Hi Devs,
>> 
>> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
>> research work. I have identified certain possibilities and am going to
>> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
>> Airavata.
>> 
>> According to what I have understood, the best approach would be to have a
>> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
>> have a new parameter in the ApplicationContext (say TargetApplication) to
>> define the target application type and resolve correct provider in the GFac
>> Scheduler based on that. I see that having this capability in the Scheduler
>> class is already a TODO. I have been able to do these changes locally and
>> invoke a simple Hadoop job using GFac. Thus, I can assure that this
>> approach is viable except for any other implication that I am missing.
>> 
>> I think we can store Hadoop job definitions in the Airavata Registry where
>> each definition would essentially include a unique identifier and other
>> attributes like mapper, reducer, sorter, formaters, etc that can be defined
>> using XBaya. Information about these building blocks could be loaded from
>> XML meta data files (of a known format) included in jar files. It should
>> also be possible to compose Hadoop job "chains" using XBaya. So, what we
>> specify in the application context would be the target application type
>> (say Hadoop), job/chain id, input file location and the output file
>> location. In addition I am thinking of having job monitoring support based
>> on constructs provided by the Hadoop API (that I have already looked into)
>> and data querying based on Apache Hive/Pig.
>> 
>> Furthermore, apart from Hadoop there are two other similar frameworks that
>> look quite promising.
>> 
>> 1. Sector/Sphere
>> 
>> Sector/Sphere [1] is an open source software framework for high-performance
>> distributed data storage and processing. It is comparable with Apache
>> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
>> programming framework that supports massive in-storage parallel data
>> processing on data stored in Sector. The key motive is that Sector/Sphere
>> is claimed to be about 2 - 4 times faster than Hadoop.
>> 
>> 2. Hyracks
>> 
>> Hyracks [2] is another framework for data-intensive computing that is
>> roughly in the same space as Apache Hadoop. It has support for composing
>> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
>> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>> 
>> I am yet to look into the API's of these two frameworks but they should
>> ideally work with the same GFac implementation that I have proposed for
>> Hadoop.
>> 
>> I would strongly appreciate your feedback on this approach. Also pros and
>> cons of using Sector/Sphere or Hyracks if you have any experience with them
>> already.
>> 
>> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
>> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
>> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>> 
>> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
>> flexible and extensible foundation for data-intensive computing,” in Data
>> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
>> 1151–1162.
>> 
>> [3] http://asterix.ics.uci.edu/
>> 
>> Thanks,
>> Danushka
>> 
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University


Re: Airavata/Hadoop Integration

Posted by Lahiru Gunathilake <gl...@gmail.com>.
Hi Danushka,

I think we already have a provider to handle Hadoop jobs which uses Apache
Whirr to setup the Hadoop cluster and submit the job.

We still didn't port this code to Airavata, once I do will send an email to
the list.

Regards
Lahiru

On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
danushka.menikkumbura@gmail.com> wrote:

> Hi Devs,
>
> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
> research work. I have identified certain possibilities and am going to
> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> Airavata.
>
> According to what I have understood, the best approach would be to have a
> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
> have a new parameter in the ApplicationContext (say TargetApplication) to
> define the target application type and resolve correct provider in the GFac
> Scheduler based on that. I see that having this capability in the Scheduler
> class is already a TODO. I have been able to do these changes locally and
> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> approach is viable except for any other implication that I am missing.
>
> I think we can store Hadoop job definitions in the Airavata Registry where
> each definition would essentially include a unique identifier and other
> attributes like mapper, reducer, sorter, formaters, etc that can be defined
> using XBaya. Information about these building blocks could be loaded from
> XML meta data files (of a known format) included in jar files. It should
> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> specify in the application context would be the target application type
> (say Hadoop), job/chain id, input file location and the output file
> location. In addition I am thinking of having job monitoring support based
> on constructs provided by the Hadoop API (that I have already looked into)
> and data querying based on Apache Hive/Pig.
>
> Furthermore, apart from Hadoop there are two other similar frameworks that
> look quite promising.
>
> 1. Sector/Sphere
>
> Sector/Sphere [1] is an open source software framework for high-performance
> distributed data storage and processing. It is comparable with Apache
> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> programming framework that supports massive in-storage parallel data
> processing on data stored in Sector. The key motive is that Sector/Sphere
> is claimed to be about 2 - 4 times faster than Hadoop.
>
> 2. Hyracks
>
> Hyracks [2] is another framework for data-intensive computing that is
> roughly in the same space as Apache Hadoop. It has support for composing
> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>
> I am yet to look into the API's of these two frameworks but they should
> ideally work with the same GFac implementation that I have proposed for
> Hadoop.
>
> I would strongly appreciate your feedback on this approach. Also pros and
> cons of using Sector/Sphere or Hyracks if you have any experience with them
> already.
>
> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>
> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
> flexible and extensible foundation for data-intensive computing,” in Data
> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
> 1151–1162.
>
> [3] http://asterix.ics.uci.edu/
>
> Thanks,
> Danushka
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Airavata/Hadoop Integration

Posted by Danushka Menikkumbura <da...@gmail.com>.
And going forward,

Do you think having a plug-in architecture is beneficial?. In think case we
can just live with what is there at the moment (just by adding support for
the Hadoop provider) but taking the overall Airavata Architecture into
consideration, do you see a need for such framework?.

I think having Hadoop job configuration support in XBaya (with the aid of
the Registry) together with the Hadoop provider would ramp up the
capabilities of Airavata to a greater extent.

What do you think?

Thanks,
Danushka


On Wed, Feb 27, 2013 at 8:35 AM, Danushka Menikkumbura <
danushka.menikkumbura@gmail.com> wrote:

> Hi Lahiru,
>
> I think we have pretty much this functionality done in the similar way you
>> are explaining. I have added the code in to trunk and will provide some
>> test classes and will update the schedular to return the HadoopProvider.
>>
>
> Yes. The Hadoop provider that you have committed more or less does the
> same thing that I was planning to do :-). I believe we can do following two
> important improvements on top of that.
>
> 1. Adding support for handling chains of jobs. This is different from
> having individual jobs orchestrated in the workflow level.
>
> 2. Support for asynchronous job execution which I believe is a must-have
> for long-running, data-intensive MapReduce jobs.
>
> I am +1 to enable these API to enable to use other components, but do you
>>  think actual users would have a concern  about the underneath library we
>> use for Mapreduce jobs ? I am not quite confident about the way people are
>> using these. But anyhow its nice to have a support for these.
>>
>
> They are not MapReduce frameworks. Sector/Sphere is a completely different
> execution framework for data-intensive computing. Hyracks is also another
> data-intensive computing framework that also supports MapReduce. The idea
> is to compare their performance and see which is better.
>
> Thanks,
> Danushka
>

Re: Airavata/Hadoop Integration

Posted by Danushka Menikkumbura <da...@gmail.com>.
> And I hope plugin architecture  is something which allow users not to touch
> the Scheduler class and plug their Providers to gfac-core by dropping their
> jar in to classpath. If thats the case this will be very useful for gateway
> developers.
>

Exactly.

Danushka, please add more detail information on how you are going to
> implement this.


We can have a plug-in architecture that is not limited to providers, I am
not quite sure where else it is applicable, though.

1. We will have a thread-safe singleton plugin manager that loads plugins
from a known location (say $AIRAVATA_HOME/plugins). The plugin manager is
initialized when the Airavata server is launched.

2.Plugins implement a predefined interface.

3.Each plugin has an identifier (say the fully qualified class name) that
is used to register itself with the plugin manager.

4. Plugin manager has a method to create new instances giving this
identifier.

5. This method is used in the Scheduler (for an example) to create a new
provider instance.

6. We define the provider identifier (i.e. the fully qualified class name)
in the JobExecutionContext (ApplicationContext?)

Thanks,
Danushka

Re: Airavata/Hadoop Integration

Posted by Lahiru Gunathilake <gl...@gmail.com>.
Hi Danushka,


On Tue, Feb 26, 2013 at 10:05 PM, Danushka Menikkumbura <
danushka.menikkumbura@gmail.com> wrote:

> Hi Lahiru,
>
> I think we have pretty much this functionality done in the similar way you
> > are explaining. I have added the code in to trunk and will provide some
> > test classes and will update the schedular to return the HadoopProvider.
> >
>
> Yes. The Hadoop provider that you have committed more or less does the same
> thing that I was planning to do :-). I believe we can do following two
> important improvements on top of that.
>
> 1. Adding support for handling chains of jobs. This is different from
> having individual jobs orchestrated in the workflow level.
>
> 2. Support for asynchronous job execution which I believe is a must-have
> for long-running, data-intensive MapReduce jobs.
>
+1

And I hope plugin architecture  is something which allow users not to touch
the Scheduler class and plug their Providers to gfac-core by dropping their
jar in to classpath. If thats the case this will be very useful for gateway
developers.

Danushka, please add more detail information on how you are going to
implement this.

Regards
Lahiru

>
> I am +1 to enable these API to enable to use other components, but do you
> >  think actual users would have a concern  about the underneath library we
> > use for Mapreduce jobs ? I am not quite confident about the way people
> are
> > using these. But anyhow its nice to have a support for these.
> >
>
> They are not MapReduce frameworks. Sector/Sphere is a completely different
> execution framework for data-intensive computing. Hyracks is also another
> data-intensive computing framework that also supports MapReduce. The idea
> is to compare their performance and see which is better.
>
> Thanks,
> Danushka
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Airavata/Hadoop Integration

Posted by Danushka Menikkumbura <da...@gmail.com>.
Hi Lahiru,

I think we have pretty much this functionality done in the similar way you
> are explaining. I have added the code in to trunk and will provide some
> test classes and will update the schedular to return the HadoopProvider.
>

Yes. The Hadoop provider that you have committed more or less does the same
thing that I was planning to do :-). I believe we can do following two
important improvements on top of that.

1. Adding support for handling chains of jobs. This is different from
having individual jobs orchestrated in the workflow level.

2. Support for asynchronous job execution which I believe is a must-have
for long-running, data-intensive MapReduce jobs.

I am +1 to enable these API to enable to use other components, but do you
>  think actual users would have a concern  about the underneath library we
> use for Mapreduce jobs ? I am not quite confident about the way people are
> using these. But anyhow its nice to have a support for these.
>

They are not MapReduce frameworks. Sector/Sphere is a completely different
execution framework for data-intensive computing. Hyracks is also another
data-intensive computing framework that also supports MapReduce. The idea
is to compare their performance and see which is better.

Thanks,
Danushka

Re: Airavata/Hadoop Integration

Posted by Lahiru Gunathilake <gl...@gmail.com>.
Hi Danushka,

On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura <
danushka.menikkumbura@gmail.com> wrote:

> Hi Devs,
>
> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
> research work. I have identified certain possibilities and am going to
> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> Airavata.
>
> According to what I have understood, the best approach would be to have a
> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
> have a new parameter in the ApplicationContext (say TargetApplication) to
> define the target application type and resolve correct provider in the GFac
> Scheduler based on that. I see that having this capability in the Scheduler
> class is already a TODO. I have been able to do these changes locally and
> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> approach is viable except for any other implication that I am missing.
>
> I think we can store Hadoop job definitions in the Airavata Registry where
> each definition would essentially include a unique identifier and other
> attributes like mapper, reducer, sorter, formaters, etc that can be defined
> using XBaya. Information about these building blocks could be loaded from
> XML meta data files (of a known format) included in jar files. It should
> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> specify in the application context would be the target application type
> (say Hadoop), job/chain id, input file location and the output file
> location. In addition I am thinking of having job monitoring support based
> on constructs provided by the Hadoop API (that I have already looked into)
> and data querying based on Apache Hive/Pig.
>
I think we have pretty much this functionality done in the similar way you
are explaining. I have added the code in to trunk and will provide some
test classes and will update the schedular to return the HadoopProvider.

>
> Furthermore, apart from Hadoop there are two other similar frameworks that
> look quite promising.
>
> 1. Sector/Sphere
>
> Sector/Sphere [1] is an open source software framework for high-performance
> distributed data storage and processing. It is comparable with Apache
> HDFS/Hadoop. Sector is a distributed file system and Sphere is the
> programming framework that supports massive in-storage parallel data
> processing on data stored in Sector. The key motive is that Sector/Sphere
> is claimed to be about 2 - 4 times faster than Hadoop.
>
> 2. Hyracks
>
> Hyracks [2] is another framework for data-intensive computing that is
> roughly in the same space as Apache Hadoop. It has support for composing
> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>
> I am +1 to enable these API to enable to use other components, but do you
think actual users would have a concern  about the underneath library we
use for Mapreduce jobs ? I am not quite confident about the way people are
using these. But anyhow its nice to have a support for these.

Regards
Lahiru

> I am yet to look into the API's of these two frameworks but they should
> ideally work with the same GFac implementation that I have proposed for
> Hadoop.
>
> I would strongly appreciate your feedback on this approach. Also pros and
> cons of using Sector/Sphere or Hyracks if you have any experience with them
> already.
>
> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>
> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
> flexible and extensible foundation for data-intensive computing,” in Data
> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
> 1151–1162.
>
> [3] http://asterix.ics.uci.edu/
>
> Thanks,
> Danushka
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Airavata/Hadoop Integration

Posted by Danushka Menikkumbura <da...@gmail.com>.
Sounds great!

Thanks Amila.


On Tue, Feb 26, 2013 at 8:46 AM, Amila Jayasekara
<th...@gmail.com>wrote:

> On Mon, Feb 25, 2013 at 9:59 PM, Danushka Menikkumbura
> <da...@gmail.com> wrote:
> > Also, I suggest we have a simple plug-in architecture for providers that
> > would make having custom providers possible.
>
> Hi Dhanushka,
>
> I guess the plugin mechanism for providers is already in-place with
> new GFac architecture. Lahiru will be able to give more information
> about this.
>
> Thanks
> Amila
>
> >
> > Thanks,
> > Danushka
> >
> >
> > On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura <
> > danushka.menikkumbura@gmail.com> wrote:
> >
> >> Hi Devs,
> >>
> >> I am looking into extending Big Data capabilities of Airavata as my
> M.Sc.
> >> research work. I have identified certain possibilities and am going to
> >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> >> Airavata.
> >>
> >> According to what I have understood, the best approach would be to have
> a
> >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We
> can
> >> have a new parameter in the ApplicationContext (say TargetApplication)
> to
> >> define the target application type and resolve correct provider in the
> GFac
> >> Scheduler based on that. I see that having this capability in the
> Scheduler
> >> class is already a TODO. I have been able to do these changes locally
> and
> >> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> >> approach is viable except for any other implication that I am missing.
> >>
> >> I think we can store Hadoop job definitions in the Airavata Registry
> where
> >> each definition would essentially include a unique identifier and other
> >> attributes like mapper, reducer, sorter, formaters, etc that can be
> defined
> >> using XBaya. Information about these building blocks could be loaded
> from
> >> XML meta data files (of a known format) included in jar files. It should
> >> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> >> specify in the application context would be the target application type
> >> (say Hadoop), job/chain id, input file location and the output file
> >> location. In addition I am thinking of having job monitoring support
> based
> >> on constructs provided by the Hadoop API (that I have already looked
> into)
> >> and data querying based on Apache Hive/Pig.
> >>
> >> Furthermore, apart from Hadoop there are two other similar frameworks
> that
> >> look quite promising.
> >>
> >> 1. Sector/Sphere
> >>
> >> Sector/Sphere [1] is an open source software framework for
> >> high-performance distributed data storage and processing. It is
> comparable
> >> with Apache HDFS/Hadoop. Sector is a distributed file system and Sphere
> is
> >> the programming framework that supports massive in-storage parallel data
> >> processing on data stored in Sector. The key motive is that
> Sector/Sphere
> >> is claimed to be about 2 - 4 times faster than Hadoop.
> >>
> >> 2. Hyracks
> >>
> >> Hyracks [2] is another framework for data-intensive computing that is
> >> roughly in the same space as Apache Hadoop. It has support for composing
> >> and executing native Hyracks jobs plus running Hadoop jobs in the
> Hyracks
> >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
> >>
> >> I am yet to look into the API's of these two frameworks but they should
> >> ideally work with the same GFac implementation that I have proposed for
> >> Hadoop.
> >>
> >> I would strongly appreciate your feedback on this approach. Also pros
> and
> >> cons of using Sector/Sphere or Hyracks if you have any experience with
> them
> >> already.
> >>
> >> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> >> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
> >>
> >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks:
> A
> >> flexible and extensible foundation for data-intensive computing,” in
> Data
> >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011,
> pp.
> >> 1151–1162.
> >>
> >> [3] http://asterix.ics.uci.edu/
> >>
> >> Thanks,
> >> Danushka
> >>
>

Re: Airavata/Hadoop Integration

Posted by Amila Jayasekara <th...@gmail.com>.
On Mon, Feb 25, 2013 at 9:59 PM, Danushka Menikkumbura
<da...@gmail.com> wrote:
> Also, I suggest we have a simple plug-in architecture for providers that
> would make having custom providers possible.

Hi Dhanushka,

I guess the plugin mechanism for providers is already in-place with
new GFac architecture. Lahiru will be able to give more information
about this.

Thanks
Amila

>
> Thanks,
> Danushka
>
>
> On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura <
> danushka.menikkumbura@gmail.com> wrote:
>
>> Hi Devs,
>>
>> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
>> research work. I have identified certain possibilities and am going to
>> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
>> Airavata.
>>
>> According to what I have understood, the best approach would be to have a
>> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
>> have a new parameter in the ApplicationContext (say TargetApplication) to
>> define the target application type and resolve correct provider in the GFac
>> Scheduler based on that. I see that having this capability in the Scheduler
>> class is already a TODO. I have been able to do these changes locally and
>> invoke a simple Hadoop job using GFac. Thus, I can assure that this
>> approach is viable except for any other implication that I am missing.
>>
>> I think we can store Hadoop job definitions in the Airavata Registry where
>> each definition would essentially include a unique identifier and other
>> attributes like mapper, reducer, sorter, formaters, etc that can be defined
>> using XBaya. Information about these building blocks could be loaded from
>> XML meta data files (of a known format) included in jar files. It should
>> also be possible to compose Hadoop job "chains" using XBaya. So, what we
>> specify in the application context would be the target application type
>> (say Hadoop), job/chain id, input file location and the output file
>> location. In addition I am thinking of having job monitoring support based
>> on constructs provided by the Hadoop API (that I have already looked into)
>> and data querying based on Apache Hive/Pig.
>>
>> Furthermore, apart from Hadoop there are two other similar frameworks that
>> look quite promising.
>>
>> 1. Sector/Sphere
>>
>> Sector/Sphere [1] is an open source software framework for
>> high-performance distributed data storage and processing. It is comparable
>> with Apache HDFS/Hadoop. Sector is a distributed file system and Sphere is
>> the programming framework that supports massive in-storage parallel data
>> processing on data stored in Sector. The key motive is that Sector/Sphere
>> is claimed to be about 2 - 4 times faster than Hadoop.
>>
>> 2. Hyracks
>>
>> Hyracks [2] is another framework for data-intensive computing that is
>> roughly in the same space as Apache Hadoop. It has support for composing
>> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
>> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>>
>> I am yet to look into the API's of these two frameworks but they should
>> ideally work with the same GFac implementation that I have proposed for
>> Hadoop.
>>
>> I would strongly appreciate your feedback on this approach. Also pros and
>> cons of using Sector/Sphere or Hyracks if you have any experience with them
>> already.
>>
>> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
>> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
>> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>>
>> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
>> flexible and extensible foundation for data-intensive computing,” in Data
>> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
>> 1151–1162.
>>
>> [3] http://asterix.ics.uci.edu/
>>
>> Thanks,
>> Danushka
>>

Re: Airavata/Hadoop Integration

Posted by Danushka Menikkumbura <da...@gmail.com>.
Also, I suggest we have a simple plug-in architecture for providers that
would make having custom providers possible.

Thanks,
Danushka


On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura <
danushka.menikkumbura@gmail.com> wrote:

> Hi Devs,
>
> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
> research work. I have identified certain possibilities and am going to
> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
> Airavata.
>
> According to what I have understood, the best approach would be to have a
> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
> have a new parameter in the ApplicationContext (say TargetApplication) to
> define the target application type and resolve correct provider in the GFac
> Scheduler based on that. I see that having this capability in the Scheduler
> class is already a TODO. I have been able to do these changes locally and
> invoke a simple Hadoop job using GFac. Thus, I can assure that this
> approach is viable except for any other implication that I am missing.
>
> I think we can store Hadoop job definitions in the Airavata Registry where
> each definition would essentially include a unique identifier and other
> attributes like mapper, reducer, sorter, formaters, etc that can be defined
> using XBaya. Information about these building blocks could be loaded from
> XML meta data files (of a known format) included in jar files. It should
> also be possible to compose Hadoop job "chains" using XBaya. So, what we
> specify in the application context would be the target application type
> (say Hadoop), job/chain id, input file location and the output file
> location. In addition I am thinking of having job monitoring support based
> on constructs provided by the Hadoop API (that I have already looked into)
> and data querying based on Apache Hive/Pig.
>
> Furthermore, apart from Hadoop there are two other similar frameworks that
> look quite promising.
>
> 1. Sector/Sphere
>
> Sector/Sphere [1] is an open source software framework for
> high-performance distributed data storage and processing. It is comparable
> with Apache HDFS/Hadoop. Sector is a distributed file system and Sphere is
> the programming framework that supports massive in-storage parallel data
> processing on data stored in Sector. The key motive is that Sector/Sphere
> is claimed to be about 2 - 4 times faster than Hadoop.
>
> 2. Hyracks
>
> Hyracks [2] is another framework for data-intensive computing that is
> roughly in the same space as Apache Hadoop. It has support for composing
> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>
> I am yet to look into the API's of these two frameworks but they should
> ideally work with the same GFac implementation that I have proposed for
> Hadoop.
>
> I would strongly appreciate your feedback on this approach. Also pros and
> cons of using Sector/Sphere or Hyracks if you have any experience with them
> already.
>
> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>
> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
> flexible and extensible foundation for data-intensive computing,” in Data
> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
> 1151–1162.
>
> [3] http://asterix.ics.uci.edu/
>
> Thanks,
> Danushka
>