You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Sourav Mazumder <so...@gmail.com> on 2015/12/24 16:48:53 UTC

What is the equivalent of Spark RDD is Flink

Hi,

I am new to Flink. Trying to understand some of the basics of Flink.

What is the equivalent of Spark's RDD in Flink ? In my understanding the
closes think is DataSet API. But wanted to reconfirm.

Also using DataSet API if I ingest a large volume of data (val lines :
DataSet[String] = env.readTextFile(<some file path and name>)), which may
not fit in single slave node, will that data get automatically distributed
in the memory of other slave nodes ?

Regards,
Sourav

Re: What is the equivalent of Spark RDD is Flink

Posted by Chiwan Park <ch...@apache.org>.

About question 1,

Scheduling once for iterative job is one of factors causing performance difference. Dongwon’s slides [1] would be helpful other factors of performance.

[1] http://flink-forward.org/?session=a-comparative-performance-evaluation-of-flink

> On Dec 31, 2015, at 9:37 AM, Stephan Ewen <se...@apache.org> wrote:
> 
> Concerning question (2):
> 
> DataSets in Flink are in most cases not materialized at all, but they represent in-flight data as it is being streamed from one operation to the next (remember, Flink is streaming in its core). So even in a MapReduce style program, the DataSet produced by the Map Function does never exist as a whole, but is continuously produced and streamed to the ReduceFunction.
> 
> The operator that executes the ReduceFunction materializes the data as part of its sorting operation. All materializing batch operations (sort / hash / cache / ...) can go out of core very reliably.
> 
> Greetings,
> Stephan
> 
> 
> 
> On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <so...@gmail.com> wrote:
> Hi Aljoscha and Chiwan,
> 
> Firstly thanks for the inputs.
> 
> Couple of follow ups -
> 
> 1. Based on Chiwan's explanation and the links my understanding is potential performance difference may happen between Spark and Flink (during iterative computation like building a model using a Machine Learning algorithm) across two iterations because of the overhead of starting a new set of tasks/operators.Other overheads would be same as both stores the intermediate results in memory. Is this understanding correct ?
> 
> 2. In case of Flink what happens if a DataSet needs to contain data which is volume wise more than total memory available in all the slave nodes ? Will it serialize the memory in the disks of respective slave nodes by default ?
> 
> Regards,
> Sourav
> 
> 
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <ch...@apache.org> wrote:
> Hi Filip,
> 
> Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.
> 
> In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.
> 
> Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.
> 
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)
> 
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
> 
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <fi...@gmail.com> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <al...@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <so...@gmail.com> wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
> 
> Regards,
> Chiwan Park

Regards,
Chiwan Park

Re: What is the equivalent of Spark RDD is Flink

Posted by Stephan Ewen <se...@apache.org>.

Concerning question (2):

DataSets in Flink are in most cases not materialized at all, but they
represent in-flight data as it is being streamed from one operation to the
next (remember, Flink is streaming in its core). So even in a MapReduce
style program, the DataSet produced by the Map Function does never exist as
a whole, but is continuously produced and streamed to the ReduceFunction.

The operator that executes the ReduceFunction materializes the data as part
of its sorting operation. All materializing batch operations (sort / hash /
cache / ...) can go out of core very reliably.

Greetings,
Stephan



On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <
sourav.mazumder00@gmail.com> wrote:

> Hi Aljoscha and Chiwan,
>
> Firstly thanks for the inputs.
>
> Couple of follow ups -
>
> 1. Based on Chiwan's explanation and the links my understanding is
> potential performance difference may happen between Spark and Flink (during
> iterative computation like building a model using a Machine Learning
> algorithm) across two iterations because of the overhead of starting a new
> set of tasks/operators.Other overheads would be same as both stores the
> intermediate results in memory. Is this understanding correct ?
>
> 2. In case of Flink what happens if a DataSet needs to contain data which
> is volume wise more than total memory available in all the slave nodes ?
> Will it serialize the memory in the disks of respective slave nodes by
> default ?
>
> Regards,
> Sourav
>
>
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <ch...@apache.org>
> wrote:
>
>> Hi Filip,
>>
>> Spark executes job also lazily. But It is slightly different from Flink.
>> Flink can execute lazily a whole job which Spark cannot execute lazily. One
>> of example is iterative job.
>>
>> In Spark, each stage of the iteration is submitted, scheduled as a job
>> and executed because of calling action in last of each iteration. In Flink,
>> although the job contains iteration, user submits only a job. Flink cluster
>> schedules and runs the job once.
>>
>> Because of this difference, in Spark, user must determine something more
>> such as “Which RDDs are cached or uncached?”.
>>
>> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s
>> answer in SO [2] would be helpful to understand this differences. :)
>>
>> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
>> [2]:
>> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
>>
>> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <fi...@gmail.com>
>> wrote:
>> >
>> > Hi Aljoscha,
>> >
>> > Sorry for a little off-topic, but I wanted to calrify whether my
>> understanding is right. You said that "Contrary to Spark, a Flink job is
>> executed lazily", however as I read in available sources, for example
>> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
>> operations" : ". The transformations are only computed when an action
>> requires a result to be returned to the driver program.". To my
>> understanding Spark implements the same lazy execution principle as Flink,
>> that is the job is only executed when a data sink/action/execute is called
>> and before that only a execution plan is built. Is that correct or are
>> there other significant differences between Spark and Flink lazy execution
>> approach that I failed to grasp?
>> >
>> > Best regards,
>> > Filip Łęczycki
>> >
>> > Pozdrawiam,
>> > Filip Łęczycki
>> >
>> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <al...@apache.org>:
>> > Hi Sourav,
>> > you are right, in Flink the equivalent to an RDD would be a DataSet (or
>> a DataStream if you are working with the streaming API).
>> >
>> > Contrary to Spark, a Flink job is executed lazily when
>> ExecutionEnvironment.execute() is called. Only then does Flink build an
>> executable program from the graph of transformations that was built by
>> calling the transformation methods on DataSet. That’s why I called it lazy.
>> The operations will also be automatically parallelized. The parallelism of
>> operations can either be configured in the cluster configuration
>> (conf/flink-conf.yaml), on a per job basis
>> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
>> setParallelism(int) on a DataSet.
>> >
>> > (Above you can always replace DataSet by DataStream, the same
>> explanations hold.)
>> >
>> > So, to get back to your question, yes, the operation of reading the
>> file (or files in a directory) will be parallelized to several worker nodes
>> based on the previously mentioned settings.
>> >
>> > Let us now if you need more information.
>> >
>> > Cheers,
>> > Aljoscha
>> >
>> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <
>> sourav.mazumder00@gmail.com> wrote:
>> > Hi,
>> >
>> > I am new to Flink. Trying to understand some of the basics of Flink.
>> >
>> > What is the equivalent of Spark's RDD in Flink ? In my understanding
>> the closes think is DataSet API. But wanted to reconfirm.
>> >
>> > Also using DataSet API if I ingest a large volume of data (val lines :
>> DataSet[String] = env.readTextFile(<some file path and name>)), which may
>> not fit in single slave node, will that data get automatically distributed
>> in the memory of other slave nodes ?
>> >
>> > Regards,
>> > Sourav
>> >
>>
>> Regards,
>> Chiwan Park
>>
>>
>>
>>
>

Re: What is the equivalent of Spark RDD is Flink

Posted by Sourav Mazumder <so...@gmail.com>.

Hi Aljoscha and Chiwan,

Firstly thanks for the inputs.

Couple of follow ups -

1. Based on Chiwan's explanation and the links my understanding is
potential performance difference may happen between Spark and Flink (during
iterative computation like building a model using a Machine Learning
algorithm) across two iterations because of the overhead of starting a new
set of tasks/operators.Other overheads would be same as both stores the
intermediate results in memory. Is this understanding correct ?

2. In case of Flink what happens if a DataSet needs to contain data which
is volume wise more than total memory available in all the slave nodes ?
Will it serialize the memory in the disks of respective slave nodes by
default ?

Regards,
Sourav


On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <ch...@apache.org> wrote:

> Hi Filip,
>
> Spark executes job also lazily. But It is slightly different from Flink.
> Flink can execute lazily a whole job which Spark cannot execute lazily. One
> of example is iterative job.
>
> In Spark, each stage of the iteration is submitted, scheduled as a job and
> executed because of calling action in last of each iteration. In Flink,
> although the job contains iteration, user submits only a job. Flink cluster
> schedules and runs the job once.
>
> Because of this difference, in Spark, user must determine something more
> such as “Which RDDs are cached or uncached?”.
>
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer
> in SO [2] would be helpful to understand this differences. :)
>
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]:
> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
>
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <fi...@gmail.com>
> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my
> understanding is right. You said that "Contrary to Spark, a Flink job is
> executed lazily", however as I read in available sources, for example
> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
> operations" : ". The transformations are only computed when an action
> requires a result to be returned to the driver program.". To my
> understanding Spark implements the same lazy execution principle as Flink,
> that is the job is only executed when a data sink/action/execute is called
> and before that only a execution plan is built. Is that correct or are
> there other significant differences between Spark and Flink lazy execution
> approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <al...@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or
> a DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when
> ExecutionEnvironment.execute() is called. Only then does Flink build an
> executable program from the graph of transformations that was built by
> calling the transformation methods on DataSet. That’s why I called it lazy.
> The operations will also be automatically parallelized. The parallelism of
> operations can either be configured in the cluster configuration
> (conf/flink-conf.yaml), on a per job basis
> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
> setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same
> explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file
> (or files in a directory) will be parallelized to several worker nodes
> based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <
> sourav.mazumder00@gmail.com> wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the
> closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines :
> DataSet[String] = env.readTextFile(<some file path and name>)), which may
> not fit in single slave node, will that data get automatically distributed
> in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
>
> Regards,
> Chiwan Park
>
>
>
>

Re: What is the equivalent of Spark RDD is Flink

Posted by Chiwan Park <ch...@apache.org>.

Hi Filip,

Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.

In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.

Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.

In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)

[1]: http://www.slideshare.net/GyulaFra/flink-apachecon
[2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning

> On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <fi...@gmail.com> wrote:
> 
> Hi Aljoscha,
> 
> Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
> 
> Best regards,
> Filip Łęczycki
> 
> Pozdrawiam,
> Filip Łęczycki
> 
> 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <al...@apache.org>:
> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
> 
> Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
> 
> (Above you can always replace DataSet by DataStream, the same explanations hold.)
> 
> So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
> 
> Let us now if you need more information.
> 
> Cheers,
> Aljoscha
> 
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <so...@gmail.com> wrote:
> Hi,
> 
> I am new to Flink. Trying to understand some of the basics of Flink.
> 
> What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
> 
> Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
> 
> Regards,
> Sourav
> 

Regards,
Chiwan Park

Re: What is the equivalent of Spark RDD is Flink

Posted by Filip Łęczycki <fi...@gmail.com>.

Hi Aljoscha,

Sorry for a little off-topic, but I wanted to calrify whether my
understanding is right. You said that "Contrary to Spark, a Flink job is
executed lazily", however as I read in available sources, for example
http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
operations" : ". The transformations are only computed when an action
requires a result to be returned to the driver program.". To my
understanding Spark implements the same lazy execution principle as Flink,
that is the job is only executed when a data sink/action/execute is called
and before that only a execution plan is built. Is that correct or are
there other significant differences between Spark and Flink lazy execution
approach that I failed to grasp?

Best regards,
Filip Łęczycki

Pozdrawiam,
Filip Łęczycki

2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <al...@apache.org>:

> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a
> DataStream if you are working with the streaming API).
>
> Contrary to Spark, a Flink job is executed lazily when
> ExecutionEnvironment.execute() is called. Only then does Flink build an
> executable program from the graph of transformations that was built by
> calling the transformation methods on DataSet. That’s why I called it lazy.
> The operations will also be automatically parallelized. The parallelism of
> operations can either be configured in the cluster configuration
> (conf/flink-conf.yaml), on a per job basis
> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
> setParallelism(int) on a DataSet.
>
> (Above you can always replace DataSet by DataStream, the same explanations
> hold.)
>
> So, to get back to your question, yes, the operation of reading the file
> (or files in a directory) will be parallelized to several worker nodes
> based on the previously mentioned settings.
>
> Let us now if you need more information.
>
> Cheers,
> Aljoscha
>
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <so...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am new to Flink. Trying to understand some of the basics of Flink.
>>
>> What is the equivalent of Spark's RDD in Flink ? In my understanding the
>> closes think is DataSet API. But wanted to reconfirm.
>>
>> Also using DataSet API if I ingest a large volume of data (val lines :
>> DataSet[String] = env.readTextFile(<some file path and name>)), which may
>> not fit in single slave node, will that data get automatically distributed
>> in the memory of other slave nodes ?
>>
>> Regards,
>> Sourav
>>
>

Re: What is the equivalent of Spark RDD is Flink

Posted by Aljoscha Krettek <al...@apache.org>.

Hi Sourav,
you are right, in Flink the equivalent to an RDD would be a DataSet (or a
DataStream if you are working with the streaming API).

Contrary to Spark, a Flink job is executed lazily when
ExecutionEnvironment.execute() is called. Only then does Flink build an
executable program from the graph of transformations that was built by
calling the transformation methods on DataSet. That’s why I called it lazy.
The operations will also be automatically parallelized. The parallelism of
operations can either be configured in the cluster configuration
(conf/flink-conf.yaml), on a per job basis
(ExecutionEnvironment.setParallelism(int)) or per operation, by calling
setParallelism(int) on a DataSet.

(Above you can always replace DataSet by DataStream, the same explanations
hold.)

So, to get back to your question, yes, the operation of reading the file
(or files in a directory) will be parallelized to several worker nodes
based on the previously mentioned settings.

Let us now if you need more information.

Cheers,
Aljoscha

On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <so...@gmail.com>
wrote:

> Hi,
>
> I am new to Flink. Trying to understand some of the basics of Flink.
>
> What is the equivalent of Spark's RDD in Flink ? In my understanding the
> closes think is DataSet API. But wanted to reconfirm.
>
> Also using DataSet API if I ingest a large volume of data (val lines :
> DataSet[String] = env.readTextFile(<some file path and name>)), which may
> not fit in single slave node, will that data get automatically distributed
> in the memory of other slave nodes ?
>
> Regards,
> Sourav
>