You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ovidiu-Cristian MARCU <ov...@inria.fr> on 2016/05/16 12:18:20 UTC

What / Where / When / How questions in Spark 2.0 ?

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow assist at a smart unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions (capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective <https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective>
[2] https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
[3] http://stratosphere.eu/project/publications/ <http://stratosphere.eu/project/publications/>

Re: What / Where / When / How questions in Spark 2.0 ?

Posted by Amit Sela <am...@gmail.com>.

I need to update this ;)
To start with, you could just take a look at branch-2.0.

On Sun, May 22, 2016, 01:23 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Thank you, Amit! I was looking for this kind of information.
>
> I did not fully read your paper, I see in it a TODO with basically the
> same question(s) [1], maybe someone from Spark team (including Databricks)
> will be so kind to send some feedback..
>
> Best,
> Ovidiu
>
> [1] Integrate “Structured Streaming”: //TODO - What (and how) will Spark
> 2.0 support (out-of-order, event-time windows, watermarks, triggers,
> accumulation modes) - how straight forward will it be to integrate with the
> Beam Model ?
>
>
> On 21 May 2016, at 23:00, Sela, Amit <AN...@paypal.com> wrote:
>
> It seems I forgot to add the link to the “Technical Vision” paper so there
> it is -
> https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
>
> From: "Sela, Amit" <AN...@paypal.com>
> Date: Saturday, May 21, 2016 at 11:52 PM
> To: Ovidiu-Cristian MARCU <ov...@inria.fr>, "user @spark"
> <us...@spark.apache.org>
> Cc: Ovidiu Cristian Marcu <ov...@gmail.com>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
>
> This is a “Technical Vision” paper for the Spark runner, which provides
> general guidelines to the future development of Spark’s Beam support as
> part of the Apache Beam (incubating) project.
> This is our JIRA -
> https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>
> Generally, I’m currently working on Datasets integration for Batch (to
> replace RDD) against Spark 1.6, and going towards enhancing Stream
> processing capabilities with Structured Streaming (2.0)
>
> And you’re welcomed to ask those questions at the Apache Beam (incubating)
> mailing list as well ;)
> http://beam.incubator.apache.org/mailing_lists/
>
> Thanks,
> Amit
>
> From: Ovidiu-Cristian MARCU <ov...@inria.fr>
> Date: Tuesday, May 17, 2016 at 12:11 AM
> To: "user @spark" <us...@spark.apache.org>
> Cc: Ovidiu Cristian Marcu <ov...@gmail.com>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
>
> Could you please consider a short answer regarding the Apache Beam
> Capability Matrix todo’s for future Spark 2.0 release [4]? (some related
> references below [5][6])
>
> Thanks
>
> [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
> [5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
> [6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
> Hi,
>
> We can see in [2] many interesting (and expected!) improvements (promises)
> like extended SQL support, unified API (DataFrames, DataSets), improved
> engine (Tungsten relates to ideas from modern compilers and MPP databases -
> similar to Flink [3]), structured streaming etc. It seems we somehow assist
> at a smart unification of Big Data analytics (Spark, Flink - best of two
> worlds)!
>
> *How does Spark respond to the missing What/Where/When/How questions
> (capabilities) highlighted in the unified model Beam [1] ?*
>
> Best,
> Ovidiu
>
> [1]
> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
> [2]
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
> [3] http://stratosphere.eu/project/publications/
>
>
>
>
>

Re: What / Where / When / How questions in Spark 2.0 ?

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Thank you, Amit! I was looking for this kind of information.

I did not fully read your paper, I see in it a TODO with basically the same question(s) [1], maybe someone from Spark team (including Databricks) will be so kind to send some feedback..

Best,
Ovidiu

[1] Integrate “Structured Streaming”: //TODO - What (and how) will Spark 2.0 support (out-of-order, event-time windows, watermarks, triggers, accumulation modes) - how straight forward will it be to integrate with the Beam Model ?


> On 21 May 2016, at 23:00, Sela, Amit <AN...@paypal.com> wrote:
> 
> It seems I forgot to add the link to the “Technical Vision” paper so there it is - https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
> 
> From: "Sela, Amit" <ANSELA@paypal.com <ma...@paypal.com>>
> Date: Saturday, May 21, 2016 at 11:52 PM
> To: Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>>, "user @spark" <user@spark.apache.org <ma...@spark.apache.org>>
> Cc: Ovidiu Cristian Marcu <ovidiu21marcu@gmail.com <ma...@gmail.com>>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> This is a “Technical Vision” paper for the Spark runner, which provides general guidelines to the future development of Spark’s Beam support as part of the Apache Beam (incubating) project.
> This is our JIRA - https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel <https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel>
> 
> Generally, I’m currently working on Datasets integration for Batch (to replace RDD) against Spark 1.6, and going towards enhancing Stream processing capabilities with Structured Streaming (2.0)
> 
> And you’re welcomed to ask those questions at the Apache Beam (incubating) mailing list as well ;)
> http://beam.incubator.apache.org/mailing_lists/ <http://beam.incubator.apache.org/mailing_lists/>
> 
> Thanks,
> Amit
> 
> From: Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>>
> Date: Tuesday, May 17, 2016 at 12:11 AM
> To: "user @spark" <user@spark.apache.org <ma...@spark.apache.org>>
> Cc: Ovidiu Cristian Marcu <ovidiu21marcu@gmail.com <ma...@gmail.com>>
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> Could you please consider a short answer regarding the Apache Beam Capability Matrix todo’s for future Spark 2.0 release [4]? (some related references below [5][6])
> 
> Thanks
> 
> [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what <http://beam.incubator.apache.org/capability-matrix/#cap-full-what>
> [5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101>
> [6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102>
> 
>> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr <ma...@inria.fr>> wrote:
>> 
>> Hi,
>> 
>> We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow assist at a smart unification of Big Data analytics (Spark, Flink - best of two worlds)!
>> 
>> How does Spark respond to the missing What/Where/When/How questions (capabilities) highlighted in the unified model Beam [1] ?
>> 
>> Best,
>> Ovidiu
>> 
>> [1] https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective <https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective>
>> [2] https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
>> [3] http://stratosphere.eu/project/publications/ <http://stratosphere.eu/project/publications/>
>> 
>> 
>

Re: What / Where / When / How questions in Spark 2.0 ?

Posted by "Sela, Amit" <AN...@paypal.com.INVALID>.

It seems I forgot to add the link to the “Technical Vision” paper so there it is - https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing

From: "Sela, Amit" <AN...@paypal.com>>
Date: Saturday, May 21, 2016 at 11:52 PM
To: Ovidiu-Cristian MARCU <ov...@inria.fr>>, "user @spark" <us...@spark.apache.org>>
Cc: Ovidiu Cristian Marcu <ov...@gmail.com>>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

This is a “Technical Vision” paper for the Spark runner, which provides general guidelines to the future development of Spark’s Beam support as part of the Apache Beam (incubating) project.
This is our JIRA - https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace RDD) against Spark 1.6, and going towards enhancing Stream processing capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU <ov...@inria.fr>>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" <us...@spark.apache.org>>
Cc: Ovidiu Cristian Marcu <ov...@gmail.com>>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability Matrix todo’s for future Spark 2.0 release [4]? (some related references below [5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <ov...@inria.fr>> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow assist at a smart unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions (capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/

Re: What / Where / When / How questions in Spark 2.0 ?

Posted by "Sela, Amit" <AN...@paypal.com.INVALID>.

This is a “Technical Vision” paper for the Spark runner, which provides general guidelines to the future development of Spark’s Beam support as part of the Apache Beam (incubating) project.
This is our JIRA - https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace RDD) against Spark 1.6, and going towards enhancing Stream processing capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU <ov...@inria.fr>>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" <us...@spark.apache.org>>
Cc: Ovidiu Cristian Marcu <ov...@gmail.com>>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability Matrix todo’s for future Spark 2.0 release [4]? (some related references below [5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <ov...@inria.fr>> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow assist at a smart unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions (capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/

Re: What / Where / When / How questions in Spark 2.0 ?

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Could you please consider a short answer regarding the Apache Beam Capability Matrix todo’s for future Spark 2.0 release [4]? (some related references below [5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what <http://beam.incubator.apache.org/capability-matrix/#cap-full-what>
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101>
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 <https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102>

> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
> 
> Hi,
> 
> We can see in [2] many interesting (and expected!) improvements (promises) like extended SQL support, unified API (DataFrames, DataSets), improved engine (Tungsten relates to ideas from modern compilers and MPP databases - similar to Flink [3]), structured streaming etc. It seems we somehow assist at a smart unification of Big Data analytics (Spark, Flink - best of two worlds)!
> 
> How does Spark respond to the missing What/Where/When/How questions (capabilities) highlighted in the unified model Beam [1] ?
> 
> Best,
> Ovidiu
> 
> [1] https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective <https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective>
> [2] https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
> [3] http://stratosphere.eu/project/publications/ <http://stratosphere.eu/project/publications/>
> 
>