You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gora.apache.org by Furkan KAMACI <fu...@gmail.com> on 2015/07/01 09:30:34 UTC

Spark Backend Support for Gora (GORA-386) Midterm Report

Hi,

First of all, I would like to thank all. As you know that I've been
accepted to GSoC 2015 with my proposal for developing a Spark Backend
Support for Gora (GORA-386) and it is the time for midterm evaluations. I
want to share my current progress of project and my midterm proposal as
well.

During my GSoC period, I've blogged at my personal website (
http://furkankamaci.com/) and created a fork from Apache Gora's master
branch and worked on it: https://github.com/kamaci/gora

At community bonding period, I've read Apache Gora documentation and Apache
Gora source code to be more familiar
with project. I've analyzed related projects including Apache Flink and
Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
fixed.

At coding period, due to implementing this project needs an infrastructure
about Apache Spark, I've started with analyzing Spark's first papers. I've
analyzed “Spark: Cluster Computing with Working” (
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
“Resilient
Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
Computing”
(https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
published two posts about Spark and Cluster Computing
(http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
Distributed Datasets (
http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
personal blog. I've followed Apache Spark documentation and developed
examples to analyze RDDs.

I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD
method. I've implemented an example application to read data from Hbase.

Apache Gora supports reading/writing data from/to Hadoop files. Spark has a
method for generating an RDD compatible with Hadoop files. So, an
architecture is designed which creates a bridge between GoraInputFormat and
RDD due to both of them support Hadoop files.

I've created a base class for Apache Gora and Spark integration named as:
GoraSparkEngine. It has initialize methods that takes Spark context, data
store, optional Hadoop configuration and returns an RDD.

After implementing a base for GoraSpark engine, I've developed a new
example aligned to LogAnalytics named as:
LogAnalyticsSpark. I've developed map and reduce parts (except for writing
results into database) which does the same thing as
LogAnalytics and also something more i.e. printing number of lines in
tables.

When we get an RDD from GoraSpark engine, we can do the operations over it
as like making operations on any other RDDs which is not created over
Apache Gora. Whole code can be checked from code base:
https://github.com/kamaci/gora

Project progress is ahead from the proposed timeline up to now.
GoraInputFormat and RDD transformation is done and it is shown that map,
reduce and other methods can properly work on that kind of RDDs.

Before the next steps, I am planning to design an overall architecture
according to feedbacks from community (there are some
prerequisites when designing an architecture: i.e. configuration of a
context at Spark cannot be changed after context has been initialized).

When necessary functionalities are implemented examples, tests and
documentations will be done. After that if I have extra time, I'm planning
to make a performance benchmark of Apache Gora with Hadoop MapReduce,
Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.

Special thanks to Lewis and Talat. I should also mention that it is a real
chance to be able to talk with your mentor face to face. We met with Talat
many times and he helped me a lot about how Hadoop and Apache Gora works.

PS: I've attached my midterm report and my previous reports can be found
here:
https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports

Kind Regards,
Furkan KAMACI

Re: Spark Backend Support for Gora (GORA-386) Midterm Report

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi All,

It's been announced that I've passed the midterm evaluations! Beside my
mentors Lewis and Talat, I am waiting your comments and suggestions about
my project during the second part of GSoC. Thank you all again!

Kind Regards,
Furkan KAMACI
1 Tem 2015 10:35 tarihinde "Lewis John Mcgibbney" <le...@gmail.com>
yazdı:

> This is fantastic.
> Needless to say the project will be progressing through mid term.
> Your blogging is very positive for dissemination of your work.
> Also like to extend a personal thank you to Talat. Excellent job and on
> behalf of the community here an exc potent effort to drive this GSOC
> project so far only half way through :).
> Looking forward to committing the initial patches into master branch and
> also your LogManagerSpark which will lower the barrier to adopting the
> module.
> Thanks
> Lewis
>
> On Wednesday, July 1, 2015, Furkan KAMACI <fu...@gmail.com> wrote:
>
>> Hi,
>>
>> First of all, I would like to thank all. As you know that I've been
>> accepted to GSoC 2015 with my proposal for developing a Spark Backend
>> Support for Gora (GORA-386) and it is the time for midterm evaluations. I
>> want to share my current progress of project and my midterm proposal as
>> well.
>>
>> During my GSoC period, I've blogged at my personal website (
>> http://furkankamaci.com/) and created a fork from Apache Gora's master
>> branch and worked on it: https://github.com/kamaci/gora
>>
>> At community bonding period, I've read Apache Gora documentation and
>> Apache Gora source code to be more familiar
>> with project. I've analyzed related projects including Apache Flink and
>> Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
>> an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
>> fixed.
>>
>> At coding period, due to implementing this project needs an
>> infrastructure about Apache Spark, I've started with analyzing Spark's
>> first papers. I've
>> analyzed “Spark: Cluster Computing with Working” (
>> http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
>> “Resilient
>> Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
>> Computing”
>> (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
>> published two posts about Spark and Cluster Computing
>> (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
>> Distributed Datasets (
>> http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
>> personal blog. I've followed Apache Spark documentation and developed
>> examples to analyze RDDs.
>>
>> I've analyzed Apache Gora's GoraInputFormat class and Spark's
>> newHadoopRDD method. I've implemented an example application to read data
>> from Hbase.
>>
>> Apache Gora supports reading/writing data from/to Hadoop files. Spark has
>> a method for generating an RDD compatible with Hadoop files. So, an
>> architecture is designed which creates a bridge between GoraInputFormat and
>> RDD due to both of them support Hadoop files.
>>
>> I've created a base class for Apache Gora and Spark integration named
>> as:  GoraSparkEngine. It has initialize methods that takes Spark context,
>> data store, optional Hadoop configuration and returns an RDD.
>>
>> After implementing a base for GoraSpark engine, I've developed a new
>> example aligned to LogAnalytics named as:
>> LogAnalyticsSpark. I've developed map and reduce parts (except for
>> writing results into database) which does the same thing as
>> LogAnalytics and also something more i.e. printing number of lines in
>> tables.
>>
>> When we get an RDD from GoraSpark engine, we can do the operations over
>> it as like making operations on any other RDDs which is not created over
>> Apache Gora. Whole code can be checked from code base:
>> https://github.com/kamaci/gora
>>
>> Project progress is ahead from the proposed timeline up to now.
>> GoraInputFormat and RDD transformation is done and it is shown that map,
>> reduce and other methods can properly work on that kind of RDDs.
>>
>> Before the next steps, I am planning to design an overall architecture
>> according to feedbacks from community (there are some
>> prerequisites when designing an architecture: i.e. configuration of a
>> context at Spark cannot be changed after context has been initialized).
>>
>> When necessary functionalities are implemented examples, tests and
>> documentations will be done. After that if I have extra time, I'm planning
>> to make a performance benchmark of Apache Gora with Hadoop MapReduce,
>> Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.
>>
>> Special thanks to Lewis and Talat. I should also mention that it is a
>> real chance to be able to talk with your mentor face to face. We met with
>> Talat many times and he helped me a lot about how Hadoop and Apache Gora
>> works.
>>
>> PS: I've attached my midterm report and my previous reports can be found
>> here:
>>
>> https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>
>
> --
> *Lewis*
>
>

Re: Spark Backend Support for Gora (GORA-386) Midterm Report

Posted by Lewis John Mcgibbney <le...@gmail.com>.

This is fantastic.
Needless to say the project will be progressing through mid term.
Your blogging is very positive for dissemination of your work.
Also like to extend a personal thank you to Talat. Excellent job and on
behalf of the community here an exc potent effort to drive this GSOC
project so far only half way through :).
Looking forward to committing the initial patches into master branch and
also your LogManagerSpark which will lower the barrier to adopting the
module.
Thanks
Lewis

On Wednesday, July 1, 2015, Furkan KAMACI <fu...@gmail.com> wrote:

> Hi,
>
> First of all, I would like to thank all. As you know that I've been
> accepted to GSoC 2015 with my proposal for developing a Spark Backend
> Support for Gora (GORA-386) and it is the time for midterm evaluations. I
> want to share my current progress of project and my midterm proposal as
> well.
>
> During my GSoC period, I've blogged at my personal website (
> http://furkankamaci.com/) and created a fork from Apache Gora's master
> branch and worked on it: https://github.com/kamaci/gora
>
> At community bonding period, I've read Apache Gora documentation and
> Apache Gora source code to be more familiar
> with project. I've analyzed related projects including Apache Flink and
> Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
> an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
> fixed.
>
> At coding period, due to implementing this project needs an infrastructure
> about Apache Spark, I've started with analyzing Spark's first papers. I've
> analyzed “Spark: Cluster Computing with Working” (
> http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
> “Resilient
> Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
> Computing”
> (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
> published two posts about Spark and Cluster Computing
> (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
> Distributed Datasets (
> http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
> personal blog. I've followed Apache Spark documentation and developed
> examples to analyze RDDs.
>
> I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD
> method. I've implemented an example application to read data from Hbase.
>
> Apache Gora supports reading/writing data from/to Hadoop files. Spark has
> a method for generating an RDD compatible with Hadoop files. So, an
> architecture is designed which creates a bridge between GoraInputFormat and
> RDD due to both of them support Hadoop files.
>
> I've created a base class for Apache Gora and Spark integration named as:
> GoraSparkEngine. It has initialize methods that takes Spark context, data
> store, optional Hadoop configuration and returns an RDD.
>
> After implementing a base for GoraSpark engine, I've developed a new
> example aligned to LogAnalytics named as:
> LogAnalyticsSpark. I've developed map and reduce parts (except for writing
> results into database) which does the same thing as
> LogAnalytics and also something more i.e. printing number of lines in
> tables.
>
> When we get an RDD from GoraSpark engine, we can do the operations over it
> as like making operations on any other RDDs which is not created over
> Apache Gora. Whole code can be checked from code base:
> https://github.com/kamaci/gora
>
> Project progress is ahead from the proposed timeline up to now.
> GoraInputFormat and RDD transformation is done and it is shown that map,
> reduce and other methods can properly work on that kind of RDDs.
>
> Before the next steps, I am planning to design an overall architecture
> according to feedbacks from community (there are some
> prerequisites when designing an architecture: i.e. configuration of a
> context at Spark cannot be changed after context has been initialized).
>
> When necessary functionalities are implemented examples, tests and
> documentations will be done. After that if I have extra time, I'm planning
> to make a performance benchmark of Apache Gora with Hadoop MapReduce,
> Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.
>
> Special thanks to Lewis and Talat. I should also mention that it is a real
> chance to be able to talk with your mentor face to face. We met with Talat
> many times and he helped me a lot about how Hadoop and Apache Gora works.
>
> PS: I've attached my midterm report and my previous reports can be found
> here:
>
> https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports
>
> Kind Regards,
> Furkan KAMACI
>


-- 
*Lewis*