You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by "Shenoy, Gourav Ganesh" <go...@indiana.edu> on 2017/06/07 15:12:34 UTC

Re: Apache Flink Execution

Hi dev,

I did some literature reading about Storm vs Flink, with an emphasis on our use-case of Distributed Task Execution and my initial impressions are as follows (I will also be updating the Google docs accordingly):


1.       Although both Storm and Flink engines appear to be similar, for supporting pipeline processing; Storm can only handle data streams, whereas Flink supports stream and batch processing. This allows Flink to perform data transfer between parallel tasks – we do not have such support as of today, but we can definitely think of parallel task execution.

2.       Storm supports at-least once and at-most once data processing, whereas Flink guarantees exactly-once processing. Storm also supports exactly-once via their Trident API. From what I read, Flink claims to be more efficient in terms of processing semantics – as they use a lighter algorithm for check-pointing data transfers.

3.       There are high level APIs available in Flink to simplify the data collection process, which is a little tedious in Storm. In Storm one needs to manually implement readers and collectors, whereas Flink provides functions such as Map, GroupBy, Window and Join.

4.       A major positive in Flink is the ability to maintain custom State information in operators/executors. This custom state information can also be used in check-pointing for fault tolerance.

I think Flink is an improvement over Storm, but this is just an understanding from my initial readings. I haven’t yet tried coding any examples in Flink. Again, most of the features/differences mentioned above, offered by both Storm and Flink, are for stream processing with focus on executing a large number of small tasks (in parallel?) with continuous streaming data and therefore the fight is for offering low latency processing; these might not necessarily be that important for the Airavata use-case (tasks may take time to complete).

Thanks and Regards,
Gourav Shenoy

From: "Pierce, Marlon" <ma...@iu.edu>
Reply-To: <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:36 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Re: Apache Flink Execution

Thanks, Apoorv.  Note for everyone else: request access if you’d like to leave a comment or make a suggestion.

Marlon

From: Apoorv Palkar <ap...@aol.com>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:32 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Apache Flink Execution

https://docs.google.com/document/d/1GDh8kEbAXVY9Gv1mmFvq__zLN_JP6m2_KbfN-9C0uO0/edit?usp=sharing

LINK for Flink Use/fundamental

Re: Apache Flink Execution

Posted by Apoorv Palkar <ap...@aol.com>.

Same deal i've found with spark. Many generic data processing is performed by spark such as map,reduce, filter. If we want to make it work, you need to add own implementation which could potentially b a problem. 

-----Original Message-----
From: Shenoy, Gourav Ganesh <go...@indiana.edu>
To: dev <de...@airavata.apache.org>
Sent: Mon, Jun 12, 2017 4:16 pm
Subject: Re: Apache Flink Execution

Hi Dev,

After doing some more readings and playing around with Storm & Flink code examples, I am now of the opinion that – although Flink provides us with certain benefits over Storm (see prev. email) – integrating Flink to suit the Airavata use case might not work. The reasons are as follows:

1.      Implementing custom functions/task-executors in Flink is not as straight forward as in Storm (bolts) – Flink uses the concept of dataset and transformations. The notion is that we define the data (bounded/unbounded), and apply transformations on this data – which is defining operators to transform the input data to output data. The problem here is that these transformations which Flink accept are limited to generic data processing, such as MAP, REDUCE, JOIN, GROUP-BY, KEY-BY, AGGREGATE, etc. The only flexibility is we can define our own implementations of these generic transformation APIs.

In constrast, for Airavata we need much more complicated implementations of task executors. These generic transformations are of no use in Airavata as they only target stream processing use cases, eg: if you have a dataset of calls made between two people and the duration of call, we can override the MAP and GROUP functions to provide a transformed dataset with <call, totalduration>. Similarly word count example.

2.      Although Flink claims to support bounded dataset (as opposed to Storm which needs unbounded data – can be tweaked to handle bounded data, but support not available natively), the datasets needs to be a Collection/Tuple (in most cases).

3.      The thing that troubles me the most is the fact that there is NO way to define custom executors and invoke them in manner in which we anticipate. Eg: We would ideally want to deploy/enable task executors – Job-Submission, Data-Staging, Monitoring, etc – on workers, and then create a DAG to invoke them. This capability is available in Storm via Topology (DAG), Spouts (dataset) and Bolts (executors). But in Flink, it’s more of how we can apply some kind of transformation on the incoming dataset and generate a new dataset – it could be either aggregating records, breaking sentences to words and grouping same words to count them, etc.

The only positive I observed was the ability to create STORM topology in Flink – but this is more of a backward compatibility support, where user applications written in Storm needs to be migrated to Flink. I am not an expert in Flink, so what I’ve pointed above is an understanding after reading literature and running by the code examples. Anyone who is has worked in Flink, please feel free to provide your inputs.

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <go...@indiana.edu>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Wednesday, June 7, 2017 at 11:12 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Re: Apache Flink Execution

Hi dev,

I did some literature reading about Storm vs Flink, with an emphasis on our use-case of Distributed Task Execution and my initial impressions are as follows (I will also be updating the Google docs accordingly):

1.     Although both Storm and Flink engines appear to be similar, for supporting pipeline processing; Storm can only handle data streams, whereas Flink supports stream and batch processing. This allows Flink to perform data transfer between parallel tasks – we do not have such support as of today, but we can definitely think of parallel task execution.
2.     Storm supports at-least once and at-most once data processing, whereas Flink guarantees exactly-once processing. Storm also supports exactly-once via their Trident API. From what I read, Flink claims to be more efficient in terms of processing semantics – as they use a lighter algorithm for check-pointing data transfers.
3.     There are high level APIs available in Flink to simplify the data collection process, which is a little tedious in Storm. In Storm one needs to manually implement readers and collectors, whereas Flink provides functions such as Map, GroupBy, Window and Join.
4.     A major positive in Flink is the ability to maintain custom State information in operators/executors. This custom state information can also be used in check-pointing for fault tolerance.

I think Flink is an improvement over Storm, but this is just an understanding from my initial readings. I haven’t yet tried coding any examples in Flink. Again, most of the features/differences mentioned above, offered by both Storm and Flink, are for stream processing with focus on executing a large number of small tasks (in parallel?) with continuous streaming data and therefore the fight is for offering low latency processing; these might not necessarily be that important for the Airavata use-case (tasks may take time to complete).

Thanks and Regards,
Gourav Shenoy

From: "Pierce, Marlon" <ma...@iu.edu>
Reply-To: <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:36 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Re: Apache Flink Execution

Thanks, Apoorv.  Note for everyone else: request access if you’d like to leave a comment or make a suggestion.

Marlon

From: Apoorv Palkar <ap...@aol.com>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:32 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Apache Flink Execution

https://docs.google.com/document/d/1GDh8kEbAXVY9Gv1mmFvq__zLN_JP6m2_KbfN-9C0uO0/edit?usp=sharing

LINK for Flink Use/fundamental

Re: Apache Flink Execution

Posted by "Shenoy, Gourav Ganesh" <go...@indiana.edu>.

Hi Dev,

After doing some more readings and playing around with Storm & Flink code examples, I am now of the opinion that – although Flink provides us with certain benefits over Storm (see prev. email) – integrating Flink to suit the Airavata use case might not work. The reasons are as follows:


1.       Implementing custom functions/task-executors in Flink is not as straight forward as in Storm (bolts) – Flink uses the concept of dataset and transformations. The notion is that we define the data (bounded/unbounded), and apply transformations on this data – which is defining operators to transform the input data to output data. The problem here is that these transformations which Flink accept are limited to generic data processing, such as MAP, REDUCE, JOIN, GROUP-BY, KEY-BY, AGGREGATE, etc. The only flexibility is we can define our own implementations of these generic transformation APIs.

In constrast, for Airavata we need much more complicated implementations of task executors. These generic transformations are of no use in Airavata as they only target stream processing use cases, eg: if you have a dataset of calls made between two people and the duration of call, we can override the MAP and GROUP functions to provide a transformed dataset with <call, totalduration>. Similarly word count example.



2.       Although Flink claims to support bounded dataset (as opposed to Storm which needs unbounded data – can be tweaked to handle bounded data, but support not available natively), the datasets needs to be a Collection/Tuple (in most cases).


3.       The thing that troubles me the most is the fact that there is NO way to define custom executors and invoke them in manner in which we anticipate. Eg: We would ideally want to deploy/enable task executors – Job-Submission, Data-Staging, Monitoring, etc – on workers, and then create a DAG to invoke them. This capability is available in Storm via Topology (DAG), Spouts (dataset) and Bolts (executors). But in Flink, it’s more of how we can apply some kind of transformation on the incoming dataset and generate a new dataset – it could be either aggregating records, breaking sentences to words and grouping same words to count them, etc.

The only positive I observed was the ability to create STORM topology in Flink – but this is more of a backward compatibility support, where user applications written in Storm needs to be migrated to Flink. I am not an expert in Flink, so what I’ve pointed above is an understanding after reading literature and running by the code examples. Anyone who is has worked in Flink, please feel free to provide your inputs.

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <go...@indiana.edu>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Wednesday, June 7, 2017 at 11:12 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Re: Apache Flink Execution

Hi dev,

I did some literature reading about Storm vs Flink, with an emphasis on our use-case of Distributed Task Execution and my initial impressions are as follows (I will also be updating the Google docs accordingly):


1.      Although both Storm and Flink engines appear to be similar, for supporting pipeline processing; Storm can only handle data streams, whereas Flink supports stream and batch processing. This allows Flink to perform data transfer between parallel tasks – we do not have such support as of today, but we can definitely think of parallel task execution.

2.      Storm supports at-least once and at-most once data processing, whereas Flink guarantees exactly-once processing. Storm also supports exactly-once via their Trident API. From what I read, Flink claims to be more efficient in terms of processing semantics – as they use a lighter algorithm for check-pointing data transfers.

3.      There are high level APIs available in Flink to simplify the data collection process, which is a little tedious in Storm. In Storm one needs to manually implement readers and collectors, whereas Flink provides functions such as Map, GroupBy, Window and Join.

4.      A major positive in Flink is the ability to maintain custom State information in operators/executors. This custom state information can also be used in check-pointing for fault tolerance.

I think Flink is an improvement over Storm, but this is just an understanding from my initial readings. I haven’t yet tried coding any examples in Flink. Again, most of the features/differences mentioned above, offered by both Storm and Flink, are for stream processing with focus on executing a large number of small tasks (in parallel?) with continuous streaming data and therefore the fight is for offering low latency processing; these might not necessarily be that important for the Airavata use-case (tasks may take time to complete).

Thanks and Regards,
Gourav Shenoy

From: "Pierce, Marlon" <ma...@iu.edu>
Reply-To: <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:36 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Re: Apache Flink Execution

Thanks, Apoorv.  Note for everyone else: request access if you’d like to leave a comment or make a suggestion.

Marlon

From: Apoorv Palkar <ap...@aol.com>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Wednesday, May 24, 2017 at 11:32 AM
To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Subject: Apache Flink Execution

https://docs.google.com/document/d/1GDh8kEbAXVY9Gv1mmFvq__zLN_JP6m2_KbfN-9C0uO0/edit?usp=sharing

LINK for Flink Use/fundamental