You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2015/07/29 03:12:29 UTC

Re: Hama vs Spark

I found research paper somewhat related with this topic.

"Both the disk based method, i.e., MR, and the memory based method,
i.e., BSP and Spark, need to load the data into main memory and
conduct the expensive computation. However, when processing topk
joins, BSP is clearly the best method as it is the only one that is
able to perform top-k joins on large datasets. This is because BSP
supports the frequent synchronizations between workers when performing
the joining procedure, which quickly lowers the joining threshold for
a given k. The winner between the MR and the Spark algorithms change
from datasets to datasets: Spark is beaten by MR on A and B while
beats MR on C." -
http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf

On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <be...@gmail.com> wrote:
> Hi all,
> *>>Apache Spark is definitely more suited for ML (iterative algorithms)
> than*
>
>
> *legacy Hadoop due to its preservation of state and optimized
> executionstrategy (RDDs). However, their approaches are still in
> synchronous iterativecommunication pattern.*
> So, Hama has a better communication model. That is a good point.
>
> *>>Moreover, BSP can have virtual **shared memory and many more benefits.*
> I read somewhere that Spark has shared variables. BSP virtual shared memory
> is something else or is it like shared variables in Spark ?
>
> *>>In addition, another one convincing*
>
> *point I think can  be a utilization ability of modern acceleration
> accessoriessuch as InfiniBand and GPUs*
> Yes, it is a good point but I found the following link. Apparently, Spark
> is also capable of doing processing on GPU's.
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
>
> *>>I'm sure that this feature will bring a*
>
> *completely new wave of big data. The problem we faced is only a lack
> ofinterest to BSP programming model. :-)*
> My knowledge is quite limited but I think you are right. With the rise of
> IoT and stream processing, GPU's will become vital. Yes, I do not
> understand that why BSP is not the programming model of choice now a days.
> It has a strong theoretical background which was proposed decades back and
> still MapReduce/Spark models are used.
>
>
> *>>Just FYI, one of my friends said after reading this thread, "if
> AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.*
> I have not written a Spark job before, but I have seen the code. BSP looks
> more intuitive to me somehow.
>
> *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> The graph module of Hama is definitely equal to GraphX of Spark.
>
> Regards,
> Behroz
>
> On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <ed...@samsung.com>
> wrote:
>
>> Hi, here's my few thoughts.
>>
>> Apache Spark is definitely more suited for ML (iterative algorithms) than
>> legacy Hadoop due to its preservation of state and optimized execution
>> strategy (RDDs). However, their approaches are still in synchronous
>> iterative
>> communication pattern.
>>
>> In Apache Hama case, it's a general-purpose pure BSP framework. While I
>> admit
>> that synchronization costs are high, the communication can be more
>> efficiently
>> realized with the message-passing BSP model. Moreover, BSP can have virtual
>> shared memory and many more benefits. In addition, another one convincing
>> point I think can  be a utilization ability of modern acceleration
>> accessories
>> such as InfiniBand and GPUs. I'm sure that this feature will bring a
>> completely new wave of big data. The problem we faced is only a lack of
>> interest to BSP programming model. :-)
>>
>> > 2) Do we have any recent benchmarks between the 2 systems ?
>>
>> It's in my todo list.
>>
>> --
>> Best Regards, Edward J. Yoon
>>
>> -----Original Message-----
>> From: Behroz Sikander [mailto:behroz89@gmail.com]
>> Sent: Thursday, June 25, 2015 12:57 AM
>> To: user@hama.apache.org
>> Subject: Hama vs Spark
>>
>> Hi,
>> A few days back, I started reading about Apache Spark. It is a pretty good
>> BigData platform. But a question arises to my mind that where Hama lies in
>> comparison with Spark if we have to implement an iterative algorithm which
>> is compute intensive (Machine learning or Optimization) ?
>>
>> I found some resources online but none answers my questions.
>>
>> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
>> 2)
>>
>> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
>> 3) I actually found the following benchmark but it is quite old.
>>
>>
>> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
>>
>> Questions:
>> 1) Is there any specific advantage when we chose BSP model instead of SPARK
>> paradigm ?
>> 2) Do we have any recent benchmarks between the 2 systems ?
>> 3) What is the main convincing point to use Hama over Spark ?
>> 4) Any scientific paper that compares both systems ? (I was not able to
>> find any)
>>
>> Regards,
>> Behroz Sikander
>>
>>
>>



-- 
Best Regards, Edward J. Yoon

RE: Hama vs Spark

Posted by "Edward J. Yoon" <ed...@samsung.com>.
Hi,

I don't fully understand how graphlab works but I'm sure that there are pros 
and cons either way. At the moment I have no plan. :-)

However, I noticed that region barrier synchronization feature within single 
BSP job (default is global barrier synchronization) is quite useful. This can 
be used for performing asynchronous mini-batches.

--
Best Regards, Edward J. Yoon

-----Original Message-----
From: Behroz Sikander [mailto:behroz89@gmail.com]
Sent: Monday, August 03, 2015 7:38 PM
To: user@hama.apache.org
Subject: Re: Hama vs Spark

I think I wrote it wrong. It should be Asynchronous Iterations. I found the
following a few months back. It was a thesis description:

*SUPPORT FOR ASYNCHRONOUS ITERATIONS IN FLINK (IN COLLABORATION WITH KTH
ROYAL INSTITUTE FOR TECHNOLOGY, SWE)*

*Context:* Currently, most of the large scale graph processing systems
adopt the bulk synchronous parallel (BSP) model. According to this model,
iterative computations happen in well -defined supersteps, which are marked
by a global barrier. BSP simplifies application development and ensures
determinism. However, it has been shown that asynchronous execution often
leads to faster convergence, for several algorithms [LBG+12]. The main goal
of this thesis is to add support for asynchronous iterative execution, in
Apache Flink, a general- purpose, distributed data processing system.

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

On Mon, Aug 3, 2015 at 3:16 AM, Edward J. Yoon <ed...@apache.org>
wrote:

> I'm not sure how it can be possible. However, I think user can find
> the slowest machine in each superstep and re-balance the loads. This
> can be handled from client (user) side.
>
> On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <be...@gmail.com>
> wrote:
> > +1. This is great.
> >
> > Btw our current implementation of Hama is Synchronous BSP i.e we have to
> > wait for the slowest machine to sync in order to move to the next super
> > step. Is there anything like Asynchronous BSP out yet ? If yes, do you
> have
> > plans to add it to this framework ?
> >
> > Regards,
> > Behroz
> >
> > On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <ed...@apache.org>
> > wrote:
> >
> >> I found research paper somewhat related with this topic.
> >>
> >> "Both the disk based method, i.e., MR, and the memory based method,
> >> i.e., BSP and Spark, need to load the data into main memory and
> >> conduct the expensive computation. However, when processing topk
> >> joins, BSP is clearly the best method as it is the only one that is
> >> able to perform top-k joins on large datasets. This is because BSP
> >> supports the frequent synchronizations between workers when performing
> >> the joining procedure, which quickly lowers the joining threshold for
> >> a given k. The winner between the MR and the Spark algorithms change
> >> from datasets to datasets: Spark is beaten by MR on A and B while
> >> beats MR on C." -
> >> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
> >>
> >> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <be...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> > *>>Apache Spark is definitely more suited for ML (iterative
> algorithms)
> >> > than*
> >> >
> >> >
> >> > *legacy Hadoop due to its preservation of state and optimized
> >> > executionstrategy (RDDs). However, their approaches are still in
> >> > synchronous iterativecommunication pattern.*
> >> > So, Hama has a better communication model. That is a good point.
> >> >
> >> > *>>Moreover, BSP can have virtual **shared memory and many more
> >> benefits.*
> >> > I read somewhere that Spark has shared variables. BSP virtual shared
> >> memory
> >> > is something else or is it like shared variables in Spark ?
> >> >
> >> > *>>In addition, another one convincing*
> >> >
> >> > *point I think can  be a utilization ability of modern acceleration
> >> > accessoriessuch as InfiniBand and GPUs*
> >> > Yes, it is a good point but I found the following link. Apparently,
> Spark
> >> > is also capable of doing processing on GPU's.
> >> >
> >>
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
> >> >
> >> > *>>I'm sure that this feature will bring a*
> >> >
> >> > *completely new wave of big data. The problem we faced is only a lack
> >> > ofinterest to BSP programming model. :-)*
> >> > My knowledge is quite limited but I think you are right. With the
> rise of
> >> > IoT and stream processing, GPU's will become vital. Yes, I do not
> >> > understand that why BSP is not the programming model of choice now a
> >> days.
> >> > It has a strong theoretical background which was proposed decades back
> >> and
> >> > still MapReduce/Spark models are used.
> >> >
> >> >
> >> > *>>Just FYI, one of my friends said after reading this thread, "if
> >> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability
> side.*
> >> > I have not written a Spark job before, but I have seen the code. BSP
> >> looks
> >> > more intuitive to me somehow.
> >> >
> >> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> >> > The graph module of Hama is definitely equal to GraphX of Spark.
> >> >
> >> > Regards,
> >> > Behroz
> >> >
> >> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <
> edward.yoon@samsung.com
> >> >
> >> > wrote:
> >> >
> >> >> Hi, here's my few thoughts.
> >> >>
> >> >> Apache Spark is definitely more suited for ML (iterative algorithms)
> >> than
> >> >> legacy Hadoop due to its preservation of state and optimized
> execution
> >> >> strategy (RDDs). However, their approaches are still in synchronous
> >> >> iterative
> >> >> communication pattern.
> >> >>
> >> >> In Apache Hama case, it's a general-purpose pure BSP framework.
> While I
> >> >> admit
> >> >> that synchronization costs are high, the communication can be more
> >> >> efficiently
> >> >> realized with the message-passing BSP model. Moreover, BSP can have
> >> virtual
> >> >> shared memory and many more benefits. In addition, another one
> >> convincing
> >> >> point I think can  be a utilization ability of modern acceleration
> >> >> accessories
> >> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
> >> >> completely new wave of big data. The problem we faced is only a lack
> of
> >> >> interest to BSP programming model. :-)
> >> >>
> >> >> > 2) Do we have any recent benchmarks between the 2 systems ?
> >> >>
> >> >> It's in my todo list.
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >>
> >> >> -----Original Message-----
> >> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
> >> >> Sent: Thursday, June 25, 2015 12:57 AM
> >> >> To: user@hama.apache.org
> >> >> Subject: Hama vs Spark
> >> >>
> >> >> Hi,
> >> >> A few days back, I started reading about Apache Spark. It is a pretty
> >> good
> >> >> BigData platform. But a question arises to my mind that where Hama
> lies
> >> in
> >> >> comparison with Spark if we have to implement an iterative algorithm
> >> which
> >> >> is compute intensive (Machine learning or Optimization) ?
> >> >>
> >> >> I found some resources online but none answers my questions.
> >> >>
> >> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
> >> >> 2)
> >> >>
> >> >>
> >>
> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
> >> >> 3) I actually found the following benchmark but it is quite old.
> >> >>
> >> >>
> >> >>
> >>
> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
> >> >>
> >> >> Questions:
> >> >> 1) Is there any specific advantage when we chose BSP model instead of
> >> SPARK
> >> >> paradigm ?
> >> >> 2) Do we have any recent benchmarks between the 2 systems ?
> >> >> 3) What is the main convincing point to use Hama over Spark ?
> >> >> 4) Any scientific paper that compares both systems ? (I was not able
> to
> >> >> find any)
> >> >>
> >> >> Regards,
> >> >> Behroz Sikander
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
>



Re: Hama vs Spark

Posted by Behroz Sikander <be...@gmail.com>.
I think I wrote it wrong. It should be Asynchronous Iterations. I found the
following a few months back. It was a thesis description:

*SUPPORT FOR ASYNCHRONOUS ITERATIONS IN FLINK (IN COLLABORATION WITH KTH
ROYAL INSTITUTE FOR TECHNOLOGY, SWE)*

*Context:* Currently, most of the large scale graph processing systems
adopt the bulk synchronous parallel (BSP) model. According to this model,
iterative computations happen in well -defined supersteps, which are marked
by a global barrier. BSP simplifies application development and ensures
determinism. However, it has been shown that asynchronous execution often
leads to faster convergence, for several algorithms [LBG+12]. The main goal
of this thesis is to add support for asynchronous iterative execution, in
Apache Flink, a general- purpose, distributed data processing system.

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

On Mon, Aug 3, 2015 at 3:16 AM, Edward J. Yoon <ed...@apache.org>
wrote:

> I'm not sure how it can be possible. However, I think user can find
> the slowest machine in each superstep and re-balance the loads. This
> can be handled from client (user) side.
>
> On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <be...@gmail.com>
> wrote:
> > +1. This is great.
> >
> > Btw our current implementation of Hama is Synchronous BSP i.e we have to
> > wait for the slowest machine to sync in order to move to the next super
> > step. Is there anything like Asynchronous BSP out yet ? If yes, do you
> have
> > plans to add it to this framework ?
> >
> > Regards,
> > Behroz
> >
> > On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <ed...@apache.org>
> > wrote:
> >
> >> I found research paper somewhat related with this topic.
> >>
> >> "Both the disk based method, i.e., MR, and the memory based method,
> >> i.e., BSP and Spark, need to load the data into main memory and
> >> conduct the expensive computation. However, when processing topk
> >> joins, BSP is clearly the best method as it is the only one that is
> >> able to perform top-k joins on large datasets. This is because BSP
> >> supports the frequent synchronizations between workers when performing
> >> the joining procedure, which quickly lowers the joining threshold for
> >> a given k. The winner between the MR and the Spark algorithms change
> >> from datasets to datasets: Spark is beaten by MR on A and B while
> >> beats MR on C." -
> >> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
> >>
> >> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <be...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> > *>>Apache Spark is definitely more suited for ML (iterative
> algorithms)
> >> > than*
> >> >
> >> >
> >> > *legacy Hadoop due to its preservation of state and optimized
> >> > executionstrategy (RDDs). However, their approaches are still in
> >> > synchronous iterativecommunication pattern.*
> >> > So, Hama has a better communication model. That is a good point.
> >> >
> >> > *>>Moreover, BSP can have virtual **shared memory and many more
> >> benefits.*
> >> > I read somewhere that Spark has shared variables. BSP virtual shared
> >> memory
> >> > is something else or is it like shared variables in Spark ?
> >> >
> >> > *>>In addition, another one convincing*
> >> >
> >> > *point I think can  be a utilization ability of modern acceleration
> >> > accessoriessuch as InfiniBand and GPUs*
> >> > Yes, it is a good point but I found the following link. Apparently,
> Spark
> >> > is also capable of doing processing on GPU's.
> >> >
> >>
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
> >> >
> >> > *>>I'm sure that this feature will bring a*
> >> >
> >> > *completely new wave of big data. The problem we faced is only a lack
> >> > ofinterest to BSP programming model. :-)*
> >> > My knowledge is quite limited but I think you are right. With the
> rise of
> >> > IoT and stream processing, GPU's will become vital. Yes, I do not
> >> > understand that why BSP is not the programming model of choice now a
> >> days.
> >> > It has a strong theoretical background which was proposed decades back
> >> and
> >> > still MapReduce/Spark models are used.
> >> >
> >> >
> >> > *>>Just FYI, one of my friends said after reading this thread, "if
> >> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability
> side.*
> >> > I have not written a Spark job before, but I have seen the code. BSP
> >> looks
> >> > more intuitive to me somehow.
> >> >
> >> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> >> > The graph module of Hama is definitely equal to GraphX of Spark.
> >> >
> >> > Regards,
> >> > Behroz
> >> >
> >> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <
> edward.yoon@samsung.com
> >> >
> >> > wrote:
> >> >
> >> >> Hi, here's my few thoughts.
> >> >>
> >> >> Apache Spark is definitely more suited for ML (iterative algorithms)
> >> than
> >> >> legacy Hadoop due to its preservation of state and optimized
> execution
> >> >> strategy (RDDs). However, their approaches are still in synchronous
> >> >> iterative
> >> >> communication pattern.
> >> >>
> >> >> In Apache Hama case, it's a general-purpose pure BSP framework.
> While I
> >> >> admit
> >> >> that synchronization costs are high, the communication can be more
> >> >> efficiently
> >> >> realized with the message-passing BSP model. Moreover, BSP can have
> >> virtual
> >> >> shared memory and many more benefits. In addition, another one
> >> convincing
> >> >> point I think can  be a utilization ability of modern acceleration
> >> >> accessories
> >> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
> >> >> completely new wave of big data. The problem we faced is only a lack
> of
> >> >> interest to BSP programming model. :-)
> >> >>
> >> >> > 2) Do we have any recent benchmarks between the 2 systems ?
> >> >>
> >> >> It's in my todo list.
> >> >>
> >> >> --
> >> >> Best Regards, Edward J. Yoon
> >> >>
> >> >> -----Original Message-----
> >> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
> >> >> Sent: Thursday, June 25, 2015 12:57 AM
> >> >> To: user@hama.apache.org
> >> >> Subject: Hama vs Spark
> >> >>
> >> >> Hi,
> >> >> A few days back, I started reading about Apache Spark. It is a pretty
> >> good
> >> >> BigData platform. But a question arises to my mind that where Hama
> lies
> >> in
> >> >> comparison with Spark if we have to implement an iterative algorithm
> >> which
> >> >> is compute intensive (Machine learning or Optimization) ?
> >> >>
> >> >> I found some resources online but none answers my questions.
> >> >>
> >> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
> >> >> 2)
> >> >>
> >> >>
> >>
> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
> >> >> 3) I actually found the following benchmark but it is quite old.
> >> >>
> >> >>
> >> >>
> >>
> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
> >> >>
> >> >> Questions:
> >> >> 1) Is there any specific advantage when we chose BSP model instead of
> >> SPARK
> >> >> paradigm ?
> >> >> 2) Do we have any recent benchmarks between the 2 systems ?
> >> >> 3) What is the main convincing point to use Hama over Spark ?
> >> >> 4) Any scientific paper that compares both systems ? (I was not able
> to
> >> >> find any)
> >> >>
> >> >> Regards,
> >> >> Behroz Sikander
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
>

Re: Hama vs Spark

Posted by "Edward J. Yoon" <ed...@apache.org>.
I'm not sure how it can be possible. However, I think user can find
the slowest machine in each superstep and re-balance the loads. This
can be handled from client (user) side.

On Sat, Aug 1, 2015 at 4:17 AM, Behroz Sikander <be...@gmail.com> wrote:
> +1. This is great.
>
> Btw our current implementation of Hama is Synchronous BSP i.e we have to
> wait for the slowest machine to sync in order to move to the next super
> step. Is there anything like Asynchronous BSP out yet ? If yes, do you have
> plans to add it to this framework ?
>
> Regards,
> Behroz
>
> On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <ed...@apache.org>
> wrote:
>
>> I found research paper somewhat related with this topic.
>>
>> "Both the disk based method, i.e., MR, and the memory based method,
>> i.e., BSP and Spark, need to load the data into main memory and
>> conduct the expensive computation. However, when processing topk
>> joins, BSP is clearly the best method as it is the only one that is
>> able to perform top-k joins on large datasets. This is because BSP
>> supports the frequent synchronizations between workers when performing
>> the joining procedure, which quickly lowers the joining threshold for
>> a given k. The winner between the MR and the Spark algorithms change
>> from datasets to datasets: Spark is beaten by MR on A and B while
>> beats MR on C." -
>> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
>>
>> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <be...@gmail.com>
>> wrote:
>> > Hi all,
>> > *>>Apache Spark is definitely more suited for ML (iterative algorithms)
>> > than*
>> >
>> >
>> > *legacy Hadoop due to its preservation of state and optimized
>> > executionstrategy (RDDs). However, their approaches are still in
>> > synchronous iterativecommunication pattern.*
>> > So, Hama has a better communication model. That is a good point.
>> >
>> > *>>Moreover, BSP can have virtual **shared memory and many more
>> benefits.*
>> > I read somewhere that Spark has shared variables. BSP virtual shared
>> memory
>> > is something else or is it like shared variables in Spark ?
>> >
>> > *>>In addition, another one convincing*
>> >
>> > *point I think can  be a utilization ability of modern acceleration
>> > accessoriessuch as InfiniBand and GPUs*
>> > Yes, it is a good point but I found the following link. Apparently, Spark
>> > is also capable of doing processing on GPU's.
>> >
>> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
>> >
>> > *>>I'm sure that this feature will bring a*
>> >
>> > *completely new wave of big data. The problem we faced is only a lack
>> > ofinterest to BSP programming model. :-)*
>> > My knowledge is quite limited but I think you are right. With the rise of
>> > IoT and stream processing, GPU's will become vital. Yes, I do not
>> > understand that why BSP is not the programming model of choice now a
>> days.
>> > It has a strong theoretical background which was proposed decades back
>> and
>> > still MapReduce/Spark models are used.
>> >
>> >
>> > *>>Just FYI, one of my friends said after reading this thread, "if
>> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.*
>> > I have not written a Spark job before, but I have seen the code. BSP
>> looks
>> > more intuitive to me somehow.
>> >
>> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
>> > The graph module of Hama is definitely equal to GraphX of Spark.
>> >
>> > Regards,
>> > Behroz
>> >
>> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <edward.yoon@samsung.com
>> >
>> > wrote:
>> >
>> >> Hi, here's my few thoughts.
>> >>
>> >> Apache Spark is definitely more suited for ML (iterative algorithms)
>> than
>> >> legacy Hadoop due to its preservation of state and optimized execution
>> >> strategy (RDDs). However, their approaches are still in synchronous
>> >> iterative
>> >> communication pattern.
>> >>
>> >> In Apache Hama case, it's a general-purpose pure BSP framework. While I
>> >> admit
>> >> that synchronization costs are high, the communication can be more
>> >> efficiently
>> >> realized with the message-passing BSP model. Moreover, BSP can have
>> virtual
>> >> shared memory and many more benefits. In addition, another one
>> convincing
>> >> point I think can  be a utilization ability of modern acceleration
>> >> accessories
>> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
>> >> completely new wave of big data. The problem we faced is only a lack of
>> >> interest to BSP programming model. :-)
>> >>
>> >> > 2) Do we have any recent benchmarks between the 2 systems ?
>> >>
>> >> It's in my todo list.
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >>
>> >> -----Original Message-----
>> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
>> >> Sent: Thursday, June 25, 2015 12:57 AM
>> >> To: user@hama.apache.org
>> >> Subject: Hama vs Spark
>> >>
>> >> Hi,
>> >> A few days back, I started reading about Apache Spark. It is a pretty
>> good
>> >> BigData platform. But a question arises to my mind that where Hama lies
>> in
>> >> comparison with Spark if we have to implement an iterative algorithm
>> which
>> >> is compute intensive (Machine learning or Optimization) ?
>> >>
>> >> I found some resources online but none answers my questions.
>> >>
>> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
>> >> 2)
>> >>
>> >>
>> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
>> >> 3) I actually found the following benchmark but it is quite old.
>> >>
>> >>
>> >>
>> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
>> >>
>> >> Questions:
>> >> 1) Is there any specific advantage when we chose BSP model instead of
>> SPARK
>> >> paradigm ?
>> >> 2) Do we have any recent benchmarks between the 2 systems ?
>> >> 3) What is the main convincing point to use Hama over Spark ?
>> >> 4) Any scientific paper that compares both systems ? (I was not able to
>> >> find any)
>> >>
>> >> Regards,
>> >> Behroz Sikander
>> >>
>> >>
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>>



-- 
Best Regards, Edward J. Yoon

Re: Hama vs Spark

Posted by Behroz Sikander <be...@gmail.com>.
+1. This is great.

Btw our current implementation of Hama is Synchronous BSP i.e we have to
wait for the slowest machine to sync in order to move to the next super
step. Is there anything like Asynchronous BSP out yet ? If yes, do you have
plans to add it to this framework ?

Regards,
Behroz

On Wed, Jul 29, 2015 at 3:12 AM, Edward J. Yoon <ed...@apache.org>
wrote:

> I found research paper somewhat related with this topic.
>
> "Both the disk based method, i.e., MR, and the memory based method,
> i.e., BSP and Spark, need to load the data into main memory and
> conduct the expensive computation. However, when processing topk
> joins, BSP is clearly the best method as it is the only one that is
> able to perform top-k joins on large datasets. This is because BSP
> supports the frequent synchronizations between workers when performing
> the joining procedure, which quickly lowers the joining threshold for
> a given k. The winner between the MR and the Spark algorithms change
> from datasets to datasets: Spark is beaten by MR on A and B while
> beats MR on C." -
> http://www.ruizhang.info/publications/TPDS2015-Heads_Join.pdf
>
> On Thu, Jun 25, 2015 at 9:02 PM, Behroz Sikander <be...@gmail.com>
> wrote:
> > Hi all,
> > *>>Apache Spark is definitely more suited for ML (iterative algorithms)
> > than*
> >
> >
> > *legacy Hadoop due to its preservation of state and optimized
> > executionstrategy (RDDs). However, their approaches are still in
> > synchronous iterativecommunication pattern.*
> > So, Hama has a better communication model. That is a good point.
> >
> > *>>Moreover, BSP can have virtual **shared memory and many more
> benefits.*
> > I read somewhere that Spark has shared variables. BSP virtual shared
> memory
> > is something else or is it like shared variables in Spark ?
> >
> > *>>In addition, another one convincing*
> >
> > *point I think can  be a utilization ability of modern acceleration
> > accessoriessuch as InfiniBand and GPUs*
> > Yes, it is a good point but I found the following link. Apparently, Spark
> > is also capable of doing processing on GPU's.
> >
> https://spark-summit.org/east-2015/talk/heterospark-a-heterogeneous-cpugpu-spark-platform-for-deep-learning-algorithms-2
> >
> > *>>I'm sure that this feature will bring a*
> >
> > *completely new wave of big data. The problem we faced is only a lack
> > ofinterest to BSP programming model. :-)*
> > My knowledge is quite limited but I think you are right. With the rise of
> > IoT and stream processing, GPU's will become vital. Yes, I do not
> > understand that why BSP is not the programming model of choice now a
> days.
> > It has a strong theoretical background which was proposed decades back
> and
> > still MapReduce/Spark models are used.
> >
> >
> > *>>Just FYI, one of my friends said after reading this thread, "if
> > AmazonEC2 = MR or BSP, Google App Engine = Spark". Maybe usability side.*
> > I have not written a Spark job before, but I have seen the code. BSP
> looks
> > more intuitive to me somehow.
> >
> > *>>Hama = GraphX (Library of Spark (Pregel model) [1])*
> > The graph module of Hama is definitely equal to GraphX of Spark.
> >
> > Regards,
> > Behroz
> >
> > On Thu, Jun 25, 2015 at 1:46 AM, Edward J. Yoon <edward.yoon@samsung.com
> >
> > wrote:
> >
> >> Hi, here's my few thoughts.
> >>
> >> Apache Spark is definitely more suited for ML (iterative algorithms)
> than
> >> legacy Hadoop due to its preservation of state and optimized execution
> >> strategy (RDDs). However, their approaches are still in synchronous
> >> iterative
> >> communication pattern.
> >>
> >> In Apache Hama case, it's a general-purpose pure BSP framework. While I
> >> admit
> >> that synchronization costs are high, the communication can be more
> >> efficiently
> >> realized with the message-passing BSP model. Moreover, BSP can have
> virtual
> >> shared memory and many more benefits. In addition, another one
> convincing
> >> point I think can  be a utilization ability of modern acceleration
> >> accessories
> >> such as InfiniBand and GPUs. I'm sure that this feature will bring a
> >> completely new wave of big data. The problem we faced is only a lack of
> >> interest to BSP programming model. :-)
> >>
> >> > 2) Do we have any recent benchmarks between the 2 systems ?
> >>
> >> It's in my todo list.
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >>
> >> -----Original Message-----
> >> From: Behroz Sikander [mailto:behroz89@gmail.com]
> >> Sent: Thursday, June 25, 2015 12:57 AM
> >> To: user@hama.apache.org
> >> Subject: Hama vs Spark
> >>
> >> Hi,
> >> A few days back, I started reading about Apache Spark. It is a pretty
> good
> >> BigData platform. But a question arises to my mind that where Hama lies
> in
> >> comparison with Spark if we have to implement an iterative algorithm
> which
> >> is compute intensive (Machine learning or Optimization) ?
> >>
> >> I found some resources online but none answers my questions.
> >>
> >> 1)BSP vs MapReduce paper <http://arxiv.org/pdf/1203.2081v2.pdf>
> >> 2)
> >>
> >>
> https://people.apache.org/~edwardyoon/documents/Hama_BSP_for_Advanced_Analytics.pdf
> >> 3) I actually found the following benchmark but it is quite old.
> >>
> >>
> >>
> http://markmail.org/message/vyjsdpv355kua7rm#query:+page:1+mid:vstgda4fhmz52pdw+state:results
> >>
> >> Questions:
> >> 1) Is there any specific advantage when we chose BSP model instead of
> SPARK
> >> paradigm ?
> >> 2) Do we have any recent benchmarks between the 2 systems ?
> >> 3) What is the main convincing point to use Hama over Spark ?
> >> 4) Any scientific paper that compares both systems ? (I was not able to
> >> find any)
> >>
> >> Regards,
> >> Behroz Sikander
> >>
> >>
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
>