You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Samarth Mailinglist <ma...@gmail.com> on 2014/12/23 18:19:27 UTC

Flink and Spark

Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions
<http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many
similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Re: Flink and Spark

Posted by Till Rohrmann <tr...@apache.org>.
Hi Samarth,

you can also find different implementations of ALS on Flink here:
https://github.com/project-flink/flink-perf/tree/master/flink-jobs/src/main/scala/com/github/projectflink/als
.

On Fri, Dec 26, 2014 at 4:12 PM, Robert Metzger <rm...@apache.org> wrote:

> For the Python API, there is a pending pull request:
> https://github.com/apache/incubator-flink/pull/202 It is still work in
> progress, but feedback is, as always, appreciated.
>
>
>
> On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Thanks a lot Márton and Gyula!
>>
>> On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <balassi.marton@gmail.com
>> > wrote:
>>
>>> Hey,
>>>
>>> You can find some ml examples like LinerRegression [1, 2] or KMeans [3,
>>> 4] in the examples package in both java and scala as a quickstart.
>>>
>>> [1]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
>>> [2]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
>>> [3]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
>>> [4]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala
>>>
>>> On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
>>> mailinglistsamarth@gmail.com> wrote:
>>>
>>>> Thank you the answers, folks.
>>>> Can anyone provide me a link for any implementation of an ML algorithm
>>>> on Flink?
>>>>
>>>> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gy...@apache.org> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> 1-2. As for failure recovery, there is a difference how the Flink
>>>>> batch and streaming programs handle failures. The failed parts of the batch
>>>>> jobs currently restart upon failures but there is an ongoing effort on fine
>>>>> grained fault tolerance which is somewhat similar to sparks lineage
>>>>> tracking. (so technically this is exactly once semantic but that is
>>>>> somewhat meaningless for batch jobs)
>>>>>
>>>>> For streaming programs we are currently working on fault tolerance, we
>>>>> are hoping to support at least once processing guarantees in the 0.9
>>>>> release. After that we will focus our research efforts on an high
>>>>> performance implementation of exactly once processing semantics, which is
>>>>> still a hard topic in streaming systems. Storm's trident's exaclty once
>>>>> semantics can only provide very low throughput while we are trying hard to
>>>>> avoid this issue, as our streaming system is capable of much higher
>>>>> throughput than storm in general as you can see on some perf measurements.
>>>>>
>>>>> 3. There are already many ml algorithms implemented for Flink but they
>>>>> are scattered all around. We are planning to collect them in a machine
>>>>> learning library soon. We are also implementing an adapter for Samoa which
>>>>> will provide some streaming machine learning algorithms as well. Samoa
>>>>> integration should be ready in January.
>>>>>
>>>>> 4. Flink carefully manages its memory use to avoid heap errors, and
>>>>> utilizing memory space as effectively as it can. The optimizer for batch
>>>>> programs also takes care of a lot of optimization steps that the user would
>>>>> manually have to do in other system, like optimizing the order of
>>>>> transformations etc. There are of course parts of the program that still
>>>>> needs to modified for maximal performance, for example parallelism settings
>>>>> for some operators in some cases.
>>>>>
>>>>> 5. As for the status of the Python API I personally cannot say very
>>>>> much, maybe someone can jump in and help me with that question :)
>>>>>
>>>>> Regards,
>>>>> Gyula
>>>>>
>>>>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>
>>>>>> Thank you for your answer. I have a couple of follow up questions:
>>>>>> 1. Does it support 'exactly once semantics' that Spark and Storm
>>>>>> support?
>>>>>> 2. (Related to 1) What happens when an error occurs during
>>>>>> processing?
>>>>>> 3. Is there a plan for adding Machine Learning support on top of
>>>>>> Flink? Say Alternative Least Squares, Basic Naive Bayes?
>>>>>> 4. When you say Flink manages itself, does it mean I don't have to
>>>>>> fiddle with number of partitions (Spark), number of reduces / happers
>>>>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>>>>> 5. How far along is the Python API? I don't see the specs in the
>>>>>> Website.
>>>>>>
>>>>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Samarth,
>>>>>>>
>>>>>>> Besides the discussions you have mentioned [1] I can recommend one
>>>>>>> of our recent presentations [2], especially the distinguishing Flink
>>>>>>> section (from slide 16).
>>>>>>>
>>>>>>> It is generally a difficult question as both the systems are rapidly
>>>>>>> evolving, so the answer can become outdated quite fast. However there are
>>>>>>> fundamental design features that are highly unlikely to change, for example
>>>>>>> Spark uses "true" batch processing, meaning that intermediate results are
>>>>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>>>>>> like streaming, forwarding the results to the next operator asap. The
>>>>>>> latter can yield performance benefits for more complex jobs. Flink also
>>>>>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>>>>>> out of memory and has some cool features around serialization. For
>>>>>>> performance numbers and some more insight please check out the presentation
>>>>>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>>>>>> something unclear or extraordinary.
>>>>>>>
>>>>>>> [1]
>>>>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Marton
>>>>>>>
>>>>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hey folks, I have a noob question.
>>>>>>>>
>>>>>>>> I already looked up the archives and saw a couple of discussions
>>>>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>>>>> about Spark and Flink.
>>>>>>>>
>>>>>>>> I am familiar with spark (the python API, esp MLLib), and I see
>>>>>>>> many similarities between Flink and Spark.
>>>>>>>>
>>>>>>>> How does Flink distinguish itself from Spark?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flink and Spark

Posted by Robert Metzger <rm...@apache.org>.
For the Python API, there is a pending pull request:
https://github.com/apache/incubator-flink/pull/202 It is still work in
progress, but feedback is, as always, appreciated.



On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist <
mailinglistsamarth@gmail.com> wrote:

> Thanks a lot Márton and Gyula!
>
> On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <ba...@gmail.com>
> wrote:
>
>> Hey,
>>
>> You can find some ml examples like LinerRegression [1, 2] or KMeans [3,
>> 4] in the examples package in both java and scala as a quickstart.
>>
>> [1]
>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
>> [2]
>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
>> [3]
>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
>> [4]
>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala
>>
>> On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
>> mailinglistsamarth@gmail.com> wrote:
>>
>>> Thank you the answers, folks.
>>> Can anyone provide me a link for any implementation of an ML algorithm
>>> on Flink?
>>>
>>> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gy...@apache.org> wrote:
>>>
>>>> Hey,
>>>>
>>>> 1-2. As for failure recovery, there is a difference how the Flink batch
>>>> and streaming programs handle failures. The failed parts of the batch jobs
>>>> currently restart upon failures but there is an ongoing effort on fine
>>>> grained fault tolerance which is somewhat similar to sparks lineage
>>>> tracking. (so technically this is exactly once semantic but that is
>>>> somewhat meaningless for batch jobs)
>>>>
>>>> For streaming programs we are currently working on fault tolerance, we
>>>> are hoping to support at least once processing guarantees in the 0.9
>>>> release. After that we will focus our research efforts on an high
>>>> performance implementation of exactly once processing semantics, which is
>>>> still a hard topic in streaming systems. Storm's trident's exaclty once
>>>> semantics can only provide very low throughput while we are trying hard to
>>>> avoid this issue, as our streaming system is capable of much higher
>>>> throughput than storm in general as you can see on some perf measurements.
>>>>
>>>> 3. There are already many ml algorithms implemented for Flink but they
>>>> are scattered all around. We are planning to collect them in a machine
>>>> learning library soon. We are also implementing an adapter for Samoa which
>>>> will provide some streaming machine learning algorithms as well. Samoa
>>>> integration should be ready in January.
>>>>
>>>> 4. Flink carefully manages its memory use to avoid heap errors, and
>>>> utilizing memory space as effectively as it can. The optimizer for batch
>>>> programs also takes care of a lot of optimization steps that the user would
>>>> manually have to do in other system, like optimizing the order of
>>>> transformations etc. There are of course parts of the program that still
>>>> needs to modified for maximal performance, for example parallelism settings
>>>> for some operators in some cases.
>>>>
>>>> 5. As for the status of the Python API I personally cannot say very
>>>> much, maybe someone can jump in and help me with that question :)
>>>>
>>>> Regards,
>>>> Gyula
>>>>
>>>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>>>> mailinglistsamarth@gmail.com> wrote:
>>>>
>>>>> Thank you for your answer. I have a couple of follow up questions:
>>>>> 1. Does it support 'exactly once semantics' that Spark and Storm
>>>>> support?
>>>>> 2. (Related to 1) What happens when an error occurs during processing?
>>>>> 3. Is there a plan for adding Machine Learning support on top of
>>>>> Flink? Say Alternative Least Squares, Basic Naive Bayes?
>>>>> 4. When you say Flink manages itself, does it mean I don't have to
>>>>> fiddle with number of partitions (Spark), number of reduces / happers
>>>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>>>> 5. How far along is the Python API? I don't see the specs in the
>>>>> Website.
>>>>>
>>>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Dear Samarth,
>>>>>>
>>>>>> Besides the discussions you have mentioned [1] I can recommend one of
>>>>>> our recent presentations [2], especially the distinguishing Flink section
>>>>>> (from slide 16).
>>>>>>
>>>>>> It is generally a difficult question as both the systems are rapidly
>>>>>> evolving, so the answer can become outdated quite fast. However there are
>>>>>> fundamental design features that are highly unlikely to change, for example
>>>>>> Spark uses "true" batch processing, meaning that intermediate results are
>>>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>>>>> like streaming, forwarding the results to the next operator asap. The
>>>>>> latter can yield performance benefits for more complex jobs. Flink also
>>>>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>>>>> out of memory and has some cool features around serialization. For
>>>>>> performance numbers and some more insight please check out the presentation
>>>>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>>>>> something unclear or extraordinary.
>>>>>>
>>>>>> [1]
>>>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Marton
>>>>>>
>>>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>>
>>>>>>> Hey folks, I have a noob question.
>>>>>>>
>>>>>>> I already looked up the archives and saw a couple of discussions
>>>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>>>> about Spark and Flink.
>>>>>>>
>>>>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>>>>> similarities between Flink and Spark.
>>>>>>>
>>>>>>> How does Flink distinguish itself from Spark?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flink and Spark

Posted by Samarth Mailinglist <ma...@gmail.com>.
Thanks a lot Márton and Gyula!

On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <ba...@gmail.com>
wrote:

> Hey,
>
> You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4]
> in the examples package in both java and scala as a quickstart.
>
> [1]
> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
> [2]
> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
> [3]
> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [4]
> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala
>
> On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Thank you the answers, folks.
>> Can anyone provide me a link for any implementation of an ML algorithm on
>> Flink?
>>
>> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gy...@apache.org> wrote:
>>
>>> Hey,
>>>
>>> 1-2. As for failure recovery, there is a difference how the Flink batch
>>> and streaming programs handle failures. The failed parts of the batch jobs
>>> currently restart upon failures but there is an ongoing effort on fine
>>> grained fault tolerance which is somewhat similar to sparks lineage
>>> tracking. (so technically this is exactly once semantic but that is
>>> somewhat meaningless for batch jobs)
>>>
>>> For streaming programs we are currently working on fault tolerance, we
>>> are hoping to support at least once processing guarantees in the 0.9
>>> release. After that we will focus our research efforts on an high
>>> performance implementation of exactly once processing semantics, which is
>>> still a hard topic in streaming systems. Storm's trident's exaclty once
>>> semantics can only provide very low throughput while we are trying hard to
>>> avoid this issue, as our streaming system is capable of much higher
>>> throughput than storm in general as you can see on some perf measurements.
>>>
>>> 3. There are already many ml algorithms implemented for Flink but they
>>> are scattered all around. We are planning to collect them in a machine
>>> learning library soon. We are also implementing an adapter for Samoa which
>>> will provide some streaming machine learning algorithms as well. Samoa
>>> integration should be ready in January.
>>>
>>> 4. Flink carefully manages its memory use to avoid heap errors, and
>>> utilizing memory space as effectively as it can. The optimizer for batch
>>> programs also takes care of a lot of optimization steps that the user would
>>> manually have to do in other system, like optimizing the order of
>>> transformations etc. There are of course parts of the program that still
>>> needs to modified for maximal performance, for example parallelism settings
>>> for some operators in some cases.
>>>
>>> 5. As for the status of the Python API I personally cannot say very
>>> much, maybe someone can jump in and help me with that question :)
>>>
>>> Regards,
>>> Gyula
>>>
>>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>>> mailinglistsamarth@gmail.com> wrote:
>>>
>>>> Thank you for your answer. I have a couple of follow up questions:
>>>> 1. Does it support 'exactly once semantics' that Spark and Storm
>>>> support?
>>>> 2. (Related to 1) What happens when an error occurs during processing?
>>>> 3. Is there a plan for adding Machine Learning support on top of Flink?
>>>> Say Alternative Least Squares, Basic Naive Bayes?
>>>> 4. When you say Flink manages itself, does it mean I don't have to
>>>> fiddle with number of partitions (Spark), number of reduces / happers
>>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>>> 5. How far along is the Python API? I don't see the specs in the
>>>> Website.
>>>>
>>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
>>>> wrote:
>>>>
>>>>> Dear Samarth,
>>>>>
>>>>> Besides the discussions you have mentioned [1] I can recommend one of
>>>>> our recent presentations [2], especially the distinguishing Flink section
>>>>> (from slide 16).
>>>>>
>>>>> It is generally a difficult question as both the systems are rapidly
>>>>> evolving, so the answer can become outdated quite fast. However there are
>>>>> fundamental design features that are highly unlikely to change, for example
>>>>> Spark uses "true" batch processing, meaning that intermediate results are
>>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>>>> like streaming, forwarding the results to the next operator asap. The
>>>>> latter can yield performance benefits for more complex jobs. Flink also
>>>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>>>> out of memory and has some cool features around serialization. For
>>>>> performance numbers and some more insight please check out the presentation
>>>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>>>> something unclear or extraordinary.
>>>>>
>>>>> [1]
>>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>>
>>>>> Best,
>>>>>
>>>>> Marton
>>>>>
>>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>
>>>>>> Hey folks, I have a noob question.
>>>>>>
>>>>>> I already looked up the archives and saw a couple of discussions
>>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>>> about Spark and Flink.
>>>>>>
>>>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>>>> similarities between Flink and Spark.
>>>>>>
>>>>>> How does Flink distinguish itself from Spark?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flink and Spark

Posted by Márton Balassi <ba...@gmail.com>.
Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4]
in the examples package in both java and scala as a quickstart.

[1]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
[2]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
[3]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
[4]
https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
mailinglistsamarth@gmail.com> wrote:

> Thank you the answers, folks.
> Can anyone provide me a link for any implementation of an ML algorithm on
> Flink?
>
> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gy...@apache.org> wrote:
>
>> Hey,
>>
>> 1-2. As for failure recovery, there is a difference how the Flink batch
>> and streaming programs handle failures. The failed parts of the batch jobs
>> currently restart upon failures but there is an ongoing effort on fine
>> grained fault tolerance which is somewhat similar to sparks lineage
>> tracking. (so technically this is exactly once semantic but that is
>> somewhat meaningless for batch jobs)
>>
>> For streaming programs we are currently working on fault tolerance, we
>> are hoping to support at least once processing guarantees in the 0.9
>> release. After that we will focus our research efforts on an high
>> performance implementation of exactly once processing semantics, which is
>> still a hard topic in streaming systems. Storm's trident's exaclty once
>> semantics can only provide very low throughput while we are trying hard to
>> avoid this issue, as our streaming system is capable of much higher
>> throughput than storm in general as you can see on some perf measurements.
>>
>> 3. There are already many ml algorithms implemented for Flink but they
>> are scattered all around. We are planning to collect them in a machine
>> learning library soon. We are also implementing an adapter for Samoa which
>> will provide some streaming machine learning algorithms as well. Samoa
>> integration should be ready in January.
>>
>> 4. Flink carefully manages its memory use to avoid heap errors, and
>> utilizing memory space as effectively as it can. The optimizer for batch
>> programs also takes care of a lot of optimization steps that the user would
>> manually have to do in other system, like optimizing the order of
>> transformations etc. There are of course parts of the program that still
>> needs to modified for maximal performance, for example parallelism settings
>> for some operators in some cases.
>>
>> 5. As for the status of the Python API I personally cannot say very much,
>> maybe someone can jump in and help me with that question :)
>>
>> Regards,
>> Gyula
>>
>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>> mailinglistsamarth@gmail.com> wrote:
>>
>>> Thank you for your answer. I have a couple of follow up questions:
>>> 1. Does it support 'exactly once semantics' that Spark and Storm support?
>>> 2. (Related to 1) What happens when an error occurs during processing?
>>> 3. Is there a plan for adding Machine Learning support on top of Flink?
>>> Say Alternative Least Squares, Basic Naive Bayes?
>>> 4. When you say Flink manages itself, does it mean I don't have to
>>> fiddle with number of partitions (Spark), number of reduces / happers
>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>> 5. How far along is the Python API? I don't see the specs in the
>>> Website.
>>>
>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
>>> wrote:
>>>
>>>> Dear Samarth,
>>>>
>>>> Besides the discussions you have mentioned [1] I can recommend one of
>>>> our recent presentations [2], especially the distinguishing Flink section
>>>> (from slide 16).
>>>>
>>>> It is generally a difficult question as both the systems are rapidly
>>>> evolving, so the answer can become outdated quite fast. However there are
>>>> fundamental design features that are highly unlikely to change, for example
>>>> Spark uses "true" batch processing, meaning that intermediate results are
>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>>> like streaming, forwarding the results to the next operator asap. The
>>>> latter can yield performance benefits for more complex jobs. Flink also
>>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>>> out of memory and has some cool features around serialization. For
>>>> performance numbers and some more insight please check out the presentation
>>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>>> something unclear or extraordinary.
>>>>
>>>> [1]
>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>
>>>> Best,
>>>>
>>>> Marton
>>>>
>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>> mailinglistsamarth@gmail.com> wrote:
>>>>
>>>>> Hey folks, I have a noob question.
>>>>>
>>>>> I already looked up the archives and saw a couple of discussions
>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>> about Spark and Flink.
>>>>>
>>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>>> similarities between Flink and Spark.
>>>>>
>>>>> How does Flink distinguish itself from Spark?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Flink and Spark

Posted by Samarth Mailinglist <ma...@gmail.com>.
Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on
Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gy...@apache.org> wrote:

> Hey,
>
> 1-2. As for failure recovery, there is a difference how the Flink batch
> and streaming programs handle failures. The failed parts of the batch jobs
> currently restart upon failures but there is an ongoing effort on fine
> grained fault tolerance which is somewhat similar to sparks lineage
> tracking. (so technically this is exactly once semantic but that is
> somewhat meaningless for batch jobs)
>
> For streaming programs we are currently working on fault tolerance, we are
> hoping to support at least once processing guarantees in the 0.9 release.
> After that we will focus our research efforts on an high performance
> implementation of exactly once processing semantics, which is still a hard
> topic in streaming systems. Storm's trident's exaclty once semantics can
> only provide very low throughput while we are trying hard to avoid this
> issue, as our streaming system is capable of much higher throughput than
> storm in general as you can see on some perf measurements.
>
> 3. There are already many ml algorithms implemented for Flink but they are
> scattered all around. We are planning to collect them in a machine learning
> library soon. We are also implementing an adapter for Samoa which will
> provide some streaming machine learning algorithms as well. Samoa
> integration should be ready in January.
>
> 4. Flink carefully manages its memory use to avoid heap errors, and
> utilizing memory space as effectively as it can. The optimizer for batch
> programs also takes care of a lot of optimization steps that the user would
> manually have to do in other system, like optimizing the order of
> transformations etc. There are of course parts of the program that still
> needs to modified for maximal performance, for example parallelism settings
> for some operators in some cases.
>
> 5. As for the status of the Python API I personally cannot say very much,
> maybe someone can jump in and help me with that question :)
>
> Regards,
> Gyula
>
> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Thank you for your answer. I have a couple of follow up questions:
>> 1. Does it support 'exactly once semantics' that Spark and Storm support?
>> 2. (Related to 1) What happens when an error occurs during processing?
>> 3. Is there a plan for adding Machine Learning support on top of Flink?
>> Say Alternative Least Squares, Basic Naive Bayes?
>> 4. When you say Flink manages itself, does it mean I don't have to fiddle
>> with number of partitions (Spark), number of reduces / happers (Hadoop?) to
>> optimize performance? (In some cases this might be needed)
>> 5. How far along is the Python API? I don't see the specs in the Website.
>>
>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
>> wrote:
>>
>>> Dear Samarth,
>>>
>>> Besides the discussions you have mentioned [1] I can recommend one of
>>> our recent presentations [2], especially the distinguishing Flink section
>>> (from slide 16).
>>>
>>> It is generally a difficult question as both the systems are rapidly
>>> evolving, so the answer can become outdated quite fast. However there are
>>> fundamental design features that are highly unlikely to change, for example
>>> Spark uses "true" batch processing, meaning that intermediate results are
>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>> like streaming, forwarding the results to the next operator asap. The
>>> latter can yield performance benefits for more complex jobs. Flink also
>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>> out of memory and has some cool features around serialization. For
>>> performance numbers and some more insight please check out the presentation
>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>> something unclear or extraordinary.
>>>
>>> [1]
>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>
>>> Best,
>>>
>>> Marton
>>>
>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>> mailinglistsamarth@gmail.com> wrote:
>>>
>>>> Hey folks, I have a noob question.
>>>>
>>>> I already looked up the archives and saw a couple of discussions
>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>> about Spark and Flink.
>>>>
>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>> similarities between Flink and Spark.
>>>>
>>>> How does Flink distinguish itself from Spark?
>>>>
>>>
>>>
>>
>

Re: Flink and Spark

Posted by Gyula Fóra <gy...@apache.org>.
Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and
streaming programs handle failures. The failed parts of the batch jobs
currently restart upon failures but there is an ongoing effort on fine
grained fault tolerance which is somewhat similar to sparks lineage
tracking. (so technically this is exactly once semantic but that is
somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are
hoping to support at least once processing guarantees in the 0.9 release.
After that we will focus our research efforts on an high performance
implementation of exactly once processing semantics, which is still a hard
topic in streaming systems. Storm's trident's exaclty once semantics can
only provide very low throughput while we are trying hard to avoid this
issue, as our streaming system is capable of much higher throughput than
storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are
scattered all around. We are planning to collect them in a machine learning
library soon. We are also implementing an adapter for Samoa which will
provide some streaming machine learning algorithms as well. Samoa
integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and
utilizing memory space as effectively as it can. The optimizer for batch
programs also takes care of a lot of optimization steps that the user would
manually have to do in other system, like optimizing the order of
transformations etc. There are of course parts of the program that still
needs to modified for maximal performance, for example parallelism settings
for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much,
maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
mailinglistsamarth@gmail.com> wrote:

> Thank you for your answer. I have a couple of follow up questions:
> 1. Does it support 'exactly once semantics' that Spark and Storm support?
> 2. (Related to 1) What happens when an error occurs during processing?
> 3. Is there a plan for adding Machine Learning support on top of Flink?
> Say Alternative Least Squares, Basic Naive Bayes?
> 4. When you say Flink manages itself, does it mean I don't have to fiddle
> with number of partitions (Spark), number of reduces / happers (Hadoop?) to
> optimize performance? (In some cases this might be needed)
> 5. How far along is the Python API? I don't see the specs in the Website.
>
> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org>
> wrote:
>
>> Dear Samarth,
>>
>> Besides the discussions you have mentioned [1] I can recommend one of our
>> recent presentations [2], especially the distinguishing Flink section (from
>> slide 16).
>>
>> It is generally a difficult question as both the systems are rapidly
>> evolving, so the answer can become outdated quite fast. However there are
>> fundamental design features that are highly unlikely to change, for example
>> Spark uses "true" batch processing, meaning that intermediate results are
>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>> like streaming, forwarding the results to the next operator asap. The
>> latter can yield performance benefits for more complex jobs. Flink also
>> gives you a query optimizer, spills gracefully to disk when the system runs
>> out of memory and has some cool features around serialization. For
>> performance numbers and some more insight please check out the presentation
>> [2] and do not hesitate to post a follow-up mail here if you come across
>> something unclear or extraordinary.
>>
>> [1]
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>
>> Best,
>>
>> Marton
>>
>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>> mailinglistsamarth@gmail.com> wrote:
>>
>>> Hey folks, I have a noob question.
>>>
>>> I already looked up the archives and saw a couple of discussions
>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>> about Spark and Flink.
>>>
>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>> similarities between Flink and Spark.
>>>
>>> How does Flink distinguish itself from Spark?
>>>
>>
>>
>

Re: Flink and Spark

Posted by Samarth Mailinglist <ma...@gmail.com>.
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say
Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle
with number of partitions (Spark), number of reduces / happers (Hadoop?) to
optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mb...@apache.org> wrote:

> Dear Samarth,
>
> Besides the discussions you have mentioned [1] I can recommend one of our
> recent presentations [2], especially the distinguishing Flink section (from
> slide 16).
>
> It is generally a difficult question as both the systems are rapidly
> evolving, so the answer can become outdated quite fast. However there are
> fundamental design features that are highly unlikely to change, for example
> Spark uses "true" batch processing, meaning that intermediate results are
> materialized (mostly in memory) as RDDs. Flink's engine is internally more
> like streaming, forwarding the results to the next operator asap. The
> latter can yield performance benefits for more complex jobs. Flink also
> gives you a query optimizer, spills gracefully to disk when the system runs
> out of memory and has some cool features around serialization. For
> performance numbers and some more insight please check out the presentation
> [2] and do not hesitate to post a follow-up mail here if you come across
> something unclear or extraordinary.
>
> [1]
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>
> Best,
>
> Marton
>
> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Hey folks, I have a noob question.
>>
>> I already looked up the archives and saw a couple of discussions
>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>> about Spark and Flink.
>>
>> I am familiar with spark (the python API, esp MLLib), and I see many
>> similarities between Flink and Spark.
>>
>> How does Flink distinguish itself from Spark?
>>
>
>

Re: Flink and Spark

Posted by Márton Balassi <mb...@apache.org>.
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our
recent presentations [2], especially the distinguishing Flink section (from
slide 16).

It is generally a difficult question as both the systems are rapidly
evolving, so the answer can become outdated quite fast. However there are
fundamental design features that are highly unlikely to change, for example
Spark uses "true" batch processing, meaning that intermediate results are
materialized (mostly in memory) as RDDs. Flink's engine is internally more
like streaming, forwarding the results to the next operator asap. The
latter can yield performance benefits for more complex jobs. Flink also
gives you a query optimizer, spills gracefully to disk when the system runs
out of memory and has some cool features around serialization. For
performance numbers and some more insight please check out the presentation
[2] and do not hesitate to post a follow-up mail here if you come across
something unclear or extraordinary.

[1]
http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
mailinglistsamarth@gmail.com> wrote:

> Hey folks, I have a noob question.
>
> I already looked up the archives and saw a couple of discussions
> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
> about Spark and Flink.
>
> I am familiar with spark (the python API, esp MLLib), and I see many
> similarities between Flink and Spark.
>
> How does Flink distinguish itself from Spark?
>