You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by Apoorv Palkar <ap...@aol.com> on 2017/05/18 01:13:19 UTC

[GSoC Plan of Attack] Choosing Apache Spark

Hey Dev,


I have started my GSoC here @ Indiana University. I have chosen to investigate Spark over Storm/Flink for our distributed model. This is because Storm/Flink are generally more better suited for live event streaming. We are analyzing the batch processing case first and then potentially considering live streaming. Spark is best suited for this because it allows for batch processing through the core engine and live processing through the Spark Streaming library. Over the past 4 days I configured the Spark standalone cluster manager to work with worker node virtual machines on AWS EC2. As Amazon was paid, we have decided to switch to the JetStream/OpenStack API. As of now, I am using Spark Standalone for the cluster manager between the core engine and workers. In addition to this, I'm investigating the use of Mesos/Yarn via Hadoop for future Airavata cluster managers.


Any suggestions would be good.


Apoorv Palkar

Re: [GSoC Plan of Attack] Choosing Apache Spark

Posted by "Christie, Marcus Aaron" <ma...@iu.edu>.

Apoorv,

Looks like you are making progress, which is great. However, I’m not quite sure what problem you are trying to solve. Is there a writeup or something on the problem you are trying to solve?

Thanks,

Marcus

On May 17, 2017, at 9:13 PM, Apoorv Palkar <ap...@aol.com>> wrote:

Hey Dev,

I have started my GSoC here @ Indiana University. I have chosen to investigate Spark over Storm/Flink for our distributed model. This is because Storm/Flink are generally more better suited for live event streaming. We are analyzing the batch processing case first and then potentially considering live streaming. Spark is best suited for this because it allows for batch processing through the core engine and live processing through the Spark Streaming library. Over the past 4 days I configured the Spark standalone cluster manager to work with worker node virtual machines on AWS EC2. As Amazon was paid, we have decided to switch to the JetStream/OpenStack API. As of now, I am using Spark Standalone for the cluster manager between the core engine and workers. In addition to this, I'm investigating the use of Mesos/Yarn via Hadoop for future Airavata cluster managers.

Any suggestions would be good.

Apoorv Palkar