You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Yuval.Itzchakov" <yu...@gmail.com> on 2016/05/15 10:52:24 UTC

Structured Streaming in Spark 2.0 and DStreams

I've been reading/watching videos about the upcoming Spark 2.0 release which
brings us Structured Streaming. One thing I've yet to understand is how this
relates to the current state of working with Streaming in Spark with the
DStream abstraction.

All examples I can find, in the Spark repository/different videos is someone
streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
the source, SparkSession seems to be defined inside org.apache.spark.sql, so
this gives me a hunch that this is somehow all related to SQL and the likes,
and not really to DStreams.

What I'm failing to understand is: Will this feature impact how we do
Streaming today? Will I be able to consume a Kafka source in a streaming
fashion (like we do today when we open a stream using KafkaUtils)? Will we
be able to do state-full operations on a Dataset[T] like we do today using
MapWithStateRDD? Or will there be a subset of operations that the catalyst
optimizer can understand such as aggregate and such?

I'd be happy anyone could shed some light on this.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Benjamin Kim <bb...@gmail.com>.

Ofir,

Thanks for the clarification. I was confused for the moment. The links will be very helpful.


> On May 15, 2016, at 2:32 PM, Ofir Manor <of...@equalum.io> wrote:
> 
> Ben,
> I'm just a Spark user - but at least in March Spark Summit, that was the main term used.
> Taking a step back from the details, maybe this new post from Reynold is a better intro to Spark 2.0 highlights.... https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
> 
> If you want to drill down, go to SPARK-8360 "Structured Streaming (aka Streaming DataFrames)". The design doc (written by Reynold in March) is very readable:
>  https://issues.apache.org/jira/browse/SPARK-8360 <https://issues.apache.org/jira/browse/SPARK-8360>
> 
> Regarding directly querying (SQL) the state managed by a streaming process - I don't know if that will land in 2.0 or only later.
> 
> Hope that helps,
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: ofir.manor@equalum.io <ma...@equalum.io>
> On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Hi Ofir,
> 
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one can directly query the data from it. Has something changed since then?
> 
> Thanks,
> Ben
> 
> 
>> On May 15, 2016, at 1:42 PM, Ofir Manor <ofir.manor@equalum.io <ma...@equalum.io>> wrote:
>> 
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging of Dataset and Dataframe - which is why it inherits all the SparkSQL goodness), while RDD seems as a low-level API only for special cases. The new Dataset should also support both batch and streaming - replacing (eventually) DStream as well. See the design docs in SPARK-13485 (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. 
>> However, as you noted, not all will be fully delivered in 2.0. For example, it seems that streaming from / to Kafka using StructuredStreaming didn't make it (so far?) to 2.0 (which is a showstopper for me). 
>> Anyway, as far as I understand, you should be able to apply stateful operators (non-RDD) on Datasets (for example, the new event-time window processing SPARK-8360). The gap I see is mostly limited streaming sources / sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples will align with the current offering...
>> 
>> 
>> Ofir Manor
>> 
>> Co-Founder & CTO | Equalum
>> 
>> 
>> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: ofir.manor@equalum.io <ma...@equalum.io>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuvalos@gmail.com <ma...@gmail.com>> wrote:
>> I've been reading/watching videos about the upcoming Spark 2.0 release which
>> brings us Structured Streaming. One thing I've yet to understand is how this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>> 
>> All examples I can find, in the Spark repository/different videos is someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
>> this gives me a hunch that this is somehow all related to SQL and the likes,
>> and not really to DStreams.
>> 
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>> 
>> I'd be happy anyone could shed some light on this.
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html <http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com <http://nabble.com/>.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
> 
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Ofir Manor <of...@equalum.io>.

Ben,
I'm just a Spark user - but at least in March Spark Summit, that was the
main term used.
Taking a step back from the details, maybe this new post from Reynold is a
better intro to Spark 2.0 highlights....
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html

If you want to drill down, go to SPARK-8360 "Structured Streaming (aka
Streaming DataFrames)". The design doc (written by Reynold in March) is
very readable:
 https://issues.apache.org/jira/browse/SPARK-8360

Regarding directly querying (SQL) the state managed by a streaming process
- I don't know if that will land in 2.0 or only later.

Hope that helps,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bb...@gmail.com> wrote:

> Hi Ofir,
>
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark
> Session unification efforts, but I don’t remember the DataSet for
> Structured Streaming aka Continuous Applications as he put it. He did
> mention streaming or unlimited DataFrames for Structured Streaming so one
> can directly query the data from it. Has something changed since then?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 1:42 PM, Ofir Manor <of...@equalum.io> wrote:
>
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Benjamin Kim <bb...@gmail.com>.

Hi Ofir,

I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one can directly query the data from it. Has something changed since then?

Thanks,
Ben


> On May 15, 2016, at 1:42 PM, Ofir Manor <of...@equalum.io> wrote:
> 
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging of Dataset and Dataframe - which is why it inherits all the SparkSQL goodness), while RDD seems as a low-level API only for special cases. The new Dataset should also support both batch and streaming - replacing (eventually) DStream as well. See the design docs in SPARK-13485 (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, it seems that streaming from / to Kafka using StructuredStreaming didn't make it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful operators (non-RDD) on Datasets (for example, the new event-time window processing SPARK-8360). The gap I see is mostly limited streaming sources / sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: ofir.manor@equalum.io <ma...@equalum.io>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuvalos@gmail.com <ma...@gmail.com>> wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html <http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Do...@ODDO, od...@gmail.com.

On 5/16/2016 9:53 AM, Yuval Itzchakov wrote:
>
> AFAIK, the underlying data represented under the DataSet[T] 
> abstraction will be formatted in Tachyon under the hood, but as with 
> RDD's if needed they will be spilled to local disk on the worker of 
> needed.
>
>

There is another option in case of RDDs - the Apache Ignite project - a 
memory grid/distributed cache that supports Spark RDDs. The nice thing 
about Ignite is that everything is done automatically for you, you can 
also duplicate caches for resiliency, load caches from disk, partition 
them etc. and you also get automatic spillover to SQL (and NoSQL) 
capable backends via read/write through capabilities. I think there is 
also effort to support dataframes. Ignite supports standard SQL to query 
the caches too.

> On Mon, May 16, 2016, 19:47 Benjamin Kim <bbuild11@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I have a curiosity question. These forever/unlimited
>     DataFrames/DataSets will persist and be query capable. I still am
>     foggy about how this data will be stored. As far as I know, memory
>     is finite. Will the data be spilled to disk and be retrievable if
>     the query spans data not in memory? Is Tachyon (Alluxio), HDFS
>     (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL),
>     Object Store (S3, Swift), or any else I can\u2019t think of going to be
>     the underlying near real-time storage system?
>
>     Thanks,
>     Ben
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Yuval Itzchakov <yu...@gmail.com>.

AFAIK, the underlying data represented under the DataSet[T] abstraction
will be formatted in Tachyon under the hood, but as with RDD's if needed
they will be spilled to local disk on the worker of needed.

On Mon, May 16, 2016, 19:47 Benjamin Kim <bb...@gmail.com> wrote:

> I have a curiosity question. These forever/unlimited DataFrames/DataSets
> will persist and be query capable. I still am foggy about how this data
> will be stored. As far as I know, memory is finite. Will the data be
> spilled to disk and be retrievable if the query spans data not in memory?
> Is Tachyon (Alluxio), HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS
> (PostgreSQL, MySQL), Object Store (S3, Swift), or any else I can’t think of
> going to be the underlying near real-time storage system?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 3:36 PM, Yuval Itzchakov <yu...@gmail.com> wrote:
>
> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they
> do a light touch on infinite Dataframes/Datasets. However, they do not go
> in depth as regards to how existing transformations on DStreams, for
> example, will be transformed into the Dataset APIs. I've been browsing the
> 2.0 branch and have yet been able to understand how they correlate.
>
> Also, placing SparkSession in the sql package seems like a peculiar
> choice, since this is going to be the global abstraction over
> SparkContext/StreamingContext from now on.
>
> On Sun, May 15, 2016, 23:42 Ofir Manor <of...@equalum.io> wrote:
>
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>> (merging of Dataset and Dataframe - which is why it inherits all the
>> SparkSQL goodness), while RDD seems as a low-level API only for special
>> cases. The new Dataset should also support both batch and streaming -
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>> However, as you noted, not all will be fully delivered in 2.0. For
>> example, it seems that streaming from / to Kafka using StructuredStreaming
>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>> Anyway, as far as I understand, you should be able to apply stateful
>> operators (non-RDD) on Datasets (for example, the new event-time window
>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>> examples will align with the current offering...
>>
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
>> wrote:
>>
>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>> which
>>> brings us Structured Streaming. One thing I've yet to understand is how
>>> this
>>> relates to the current state of working with Streaming in Spark with the
>>> DStream abstraction.
>>>
>>> All examples I can find, in the Spark repository/different videos is
>>> someone
>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>> browsing
>>> the source, SparkSession seems to be defined inside
>>> org.apache.spark.sql, so
>>> this gives me a hunch that this is somehow all related to SQL and the
>>> likes,
>>> and not really to DStreams.
>>>
>>> What I'm failing to understand is: Will this feature impact how we do
>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>> we
>>> be able to do state-full operations on a Dataset[T] like we do today
>>> using
>>> MapWithStateRDD? Or will there be a subset of operations that the
>>> catalyst
>>> optimizer can understand such as aggregate and such?
>>>
>>> I'd be happy anyone could shed some light on this.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Benjamin Kim <bb...@gmail.com>.

I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is Tachyon (Alluxio), HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL), Object Store (S3, Swift), or any else I can’t think of going to be the underlying near real-time storage system?

Thanks,
Ben

> On May 15, 2016, at 3:36 PM, Yuval Itzchakov <yu...@gmail.com> wrote:
> 
> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they do a light touch on infinite Dataframes/Datasets. However, they do not go in depth as regards to how existing transformations on DStreams, for example, will be transformed into the Dataset APIs. I've been browsing the 2.0 branch and have yet been able to understand how they correlate.
> 
> Also, placing SparkSession in the sql package seems like a peculiar choice, since this is going to be the global abstraction over SparkContext/StreamingContext from now on.
> 
> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.manor@equalum.io <ma...@equalum.io>> wrote:
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging of Dataset and Dataframe - which is why it inherits all the SparkSQL goodness), while RDD seems as a low-level API only for special cases. The new Dataset should also support both batch and streaming - replacing (eventually) DStream as well. See the design docs in SPARK-13485 (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, it seems that streaming from / to Kafka using StructuredStreaming didn't make it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful operators (non-RDD) on Datasets (for example, the new event-time window processing SPARK-8360). The gap I see is mostly limited streaming sources / sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: ofir.manor@equalum.io <ma...@equalum.io>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuvalos@gmail.com <ma...@gmail.com>> wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html <http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Yuval Itzchakov <yu...@gmail.com>.

Oh, that looks neat! Thx, will read up on that.

On Mon, May 16, 2016, 14:10 Ofir Manor <of...@equalum.io> wrote:

> Yuval,
> Not sure what in-scope to land in 2.0, but there is another new infra bit
> to manage state more efficiently called State Store, whose initial version
> is already commited:
>    SPARK-13809 - State Store: A new framework for state management for
> computing Streaming Aggregates
> https://issues.apache.org/jira/browse/SPARK-13809
> Eventually the pull request links into the design doc, that discusses the
> limits of updateStateByKey and mapWithState and how that will be
> handled...
>
> At a quick glance at the code, it seems to be used already in streaming
> aggregations.
>
> Just my two cents,
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
> On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yu...@gmail.com>
> wrote:
>
>> Also, re-reading the relevant part from the Structured Streaming
>> documentation (
>> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
>> ):
>> Discretized streams (aka dstream)
>>
>> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
>> are two main challenges with dstream:
>>
>>
>>    1.
>>
>>    Similar to Storm, it exposes a monotonic system (processing) time
>>    metric, and makes support for event time difficult.
>>    2.
>>
>>    Its APIs are tied to the underlying microbatch execution model, and
>>    as a result lead to inflexibilities such as changing the underlying batch
>>    interval would require changing the window size.
>>
>>
>> RQ addresses the above:
>>
>>
>>    1.
>>
>>    RQ operations support both system time and event time.
>>    2.
>>
>>    RQ APIs are decoupled from the underlying execution model. As a
>>    matter of fact, it is possible to implement an alternative engine that is
>>    not microbatch-based for RQ.
>>    3. In addition, due to the declarative specification of operations,
>>    RQ leverages a relational query optimizer and can often generate more
>>    efficient query plans.
>>
>>
>> This doesn't seem to attack the actual underlying implementation for how
>> things like "mapWithState" are going to be translated into RQ, and I think
>> thats the hole that's causing my misunderstanding.
>>
>> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yu...@gmail.com>
>> wrote:
>>
>>> Hi Ofir,
>>> Thanks for the elaborated answer. I have read both documents, where they
>>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>>> in depth as regards to how existing transformations on DStreams, for
>>> example, will be transformed into the Dataset APIs. I've been browsing the
>>> 2.0 branch and have yet been able to understand how they correlate.
>>>
>>> Also, placing SparkSession in the sql package seems like a peculiar
>>> choice, since this is going to be the global abstraction over
>>> SparkContext/StreamingContext from now on.
>>>
>>> On Sun, May 15, 2016, 23:42 Ofir Manor <of...@equalum.io> wrote:
>>>
>>>> Hi Yuval,
>>>> let me share my understanding based on similar questions I had.
>>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just
>>>> two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>>> cases. The new Dataset should also support both batch and streaming -
>>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>>> However, as you noted, not all will be fully delivered in 2.0. For
>>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>>> Anyway, as far as I understand, you should be able to apply stateful
>>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>>> sinks migrated to the new (richer) API and semantics.
>>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>>> examples will align with the current offering...
>>>>
>>>>
>>>> Ofir Manor
>>>>
>>>> Co-Founder & CTO | Equalum
>>>>
>>>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>>>
>>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>>>> which
>>>>> brings us Structured Streaming. One thing I've yet to understand is
>>>>> how this
>>>>> relates to the current state of working with Streaming in Spark with
>>>>> the
>>>>> DStream abstraction.
>>>>>
>>>>> All examples I can find, in the Spark repository/different videos is
>>>>> someone
>>>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>>>> browsing
>>>>> the source, SparkSession seems to be defined inside
>>>>> org.apache.spark.sql, so
>>>>> this gives me a hunch that this is somehow all related to SQL and the
>>>>> likes,
>>>>> and not really to DStreams.
>>>>>
>>>>> What I'm failing to understand is: Will this feature impact how we do
>>>>> Streaming today? Will I be able to consume a Kafka source in a
>>>>> streaming
>>>>> fashion (like we do today when we open a stream using KafkaUtils)?
>>>>> Will we
>>>>> be able to do state-full operations on a Dataset[T] like we do today
>>>>> using
>>>>> MapWithStateRDD? Or will there be a subset of operations that the
>>>>> catalyst
>>>>> optimizer can understand such as aggregate and such?
>>>>>
>>>>> I'd be happy anyone could shed some light on this.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Ofir Manor <of...@equalum.io>.

Yuval,
Not sure what in-scope to land in 2.0, but there is another new infra bit
to manage state more efficiently called State Store, whose initial version
is already commited:
   SPARK-13809 - State Store: A new framework for state management for
computing Streaming Aggregates
https://issues.apache.org/jira/browse/SPARK-13809
Eventually the pull request links into the design doc, that discusses the
limits of updateStateByKey and mapWithState and how that will be
handled...

At a quick glance at the code, it seems to be used already in streaming
aggregations.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yu...@gmail.com> wrote:

> Also, re-reading the relevant part from the Structured Streaming
> documentation (
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
> ):
> Discretized streams (aka dstream)
>
> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
> are two main challenges with dstream:
>
>
>    1.
>
>    Similar to Storm, it exposes a monotonic system (processing) time
>    metric, and makes support for event time difficult.
>    2.
>
>    Its APIs are tied to the underlying microbatch execution model, and as
>    a result lead to inflexibilities such as changing the underlying batch
>    interval would require changing the window size.
>
>
> RQ addresses the above:
>
>
>    1.
>
>    RQ operations support both system time and event time.
>    2.
>
>    RQ APIs are decoupled from the underlying execution model. As a matter
>    of fact, it is possible to implement an alternative engine that is not
>    microbatch-based for RQ.
>    3. In addition, due to the declarative specification of operations, RQ
>    leverages a relational query optimizer and can often generate more
>    efficient query plans.
>
>
> This doesn't seem to attack the actual underlying implementation for how
> things like "mapWithState" are going to be translated into RQ, and I think
> thats the hole that's causing my misunderstanding.
>
> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yu...@gmail.com> wrote:
>
>> Hi Ofir,
>> Thanks for the elaborated answer. I have read both documents, where they
>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>> in depth as regards to how existing transformations on DStreams, for
>> example, will be transformed into the Dataset APIs. I've been browsing the
>> 2.0 branch and have yet been able to understand how they correlate.
>>
>> Also, placing SparkSession in the sql package seems like a peculiar
>> choice, since this is going to be the global abstraction over
>> SparkContext/StreamingContext from now on.
>>
>> On Sun, May 15, 2016, 23:42 Ofir Manor <of...@equalum.io> wrote:
>>
>>> Hi Yuval,
>>> let me share my understanding based on similar questions I had.
>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>> cases. The new Dataset should also support both batch and streaming -
>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>> However, as you noted, not all will be fully delivered in 2.0. For
>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>> Anyway, as far as I understand, you should be able to apply stateful
>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>> sinks migrated to the new (richer) API and semantics.
>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>> examples will align with the current offering...
>>>
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>>
>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
>>> wrote:
>>>
>>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>>> which
>>>> brings us Structured Streaming. One thing I've yet to understand is how
>>>> this
>>>> relates to the current state of working with Streaming in Spark with the
>>>> DStream abstraction.
>>>>
>>>> All examples I can find, in the Spark repository/different videos is
>>>> someone
>>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>>> browsing
>>>> the source, SparkSession seems to be defined inside
>>>> org.apache.spark.sql, so
>>>> this gives me a hunch that this is somehow all related to SQL and the
>>>> likes,
>>>> and not really to DStreams.
>>>>
>>>> What I'm failing to understand is: Will this feature impact how we do
>>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>>> we
>>>> be able to do state-full operations on a Dataset[T] like we do today
>>>> using
>>>> MapWithStateRDD? Or will there be a subset of operations that the
>>>> catalyst
>>>> optimizer can understand such as aggregate and such?
>>>>
>>>> I'd be happy anyone could shed some light on this.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Yuval Itzchakov <yu...@gmail.com>.

Also, re-reading the relevant part from the Structured Streaming
documentation (
https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
):
Discretized streams (aka dstream)

Unlike Storm, dstream exposes a higher level API similar to RDDs. There are
two main challenges with dstream:


   1.

   Similar to Storm, it exposes a monotonic system (processing) time
   metric, and makes support for event time difficult.
   2.

   Its APIs are tied to the underlying microbatch execution model, and as a
   result lead to inflexibilities such as changing the underlying batch
   interval would require changing the window size.


RQ addresses the above:


   1.

   RQ operations support both system time and event time.
   2.

   RQ APIs are decoupled from the underlying execution model. As a matter
   of fact, it is possible to implement an alternative engine that is not
   microbatch-based for RQ.
   3. In addition, due to the declarative specification of operations, RQ
   leverages a relational query optimizer and can often generate more
   efficient query plans.


This doesn't seem to attack the actual underlying implementation for how
things like "mapWithState" are going to be translated into RQ, and I think
thats the hole that's causing my misunderstanding.

On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yu...@gmail.com> wrote:

> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they
> do a light touch on infinite Dataframes/Datasets. However, they do not go
> in depth as regards to how existing transformations on DStreams, for
> example, will be transformed into the Dataset APIs. I've been browsing the
> 2.0 branch and have yet been able to understand how they correlate.
>
> Also, placing SparkSession in the sql package seems like a peculiar
> choice, since this is going to be the global abstraction over
> SparkContext/StreamingContext from now on.
>
> On Sun, May 15, 2016, 23:42 Ofir Manor <of...@equalum.io> wrote:
>
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>> (merging of Dataset and Dataframe - which is why it inherits all the
>> SparkSQL goodness), while RDD seems as a low-level API only for special
>> cases. The new Dataset should also support both batch and streaming -
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>> However, as you noted, not all will be fully delivered in 2.0. For
>> example, it seems that streaming from / to Kafka using StructuredStreaming
>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>> Anyway, as far as I understand, you should be able to apply stateful
>> operators (non-RDD) on Datasets (for example, the new event-time window
>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>> examples will align with the current offering...
>>
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
>> wrote:
>>
>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>> which
>>> brings us Structured Streaming. One thing I've yet to understand is how
>>> this
>>> relates to the current state of working with Streaming in Spark with the
>>> DStream abstraction.
>>>
>>> All examples I can find, in the Spark repository/different videos is
>>> someone
>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>> browsing
>>> the source, SparkSession seems to be defined inside
>>> org.apache.spark.sql, so
>>> this gives me a hunch that this is somehow all related to SQL and the
>>> likes,
>>> and not really to DStreams.
>>>
>>> What I'm failing to understand is: Will this feature impact how we do
>>> Streaming today? Will I be able to consume a Kafka source in a streaming
>>> fashion (like we do today when we open a stream using KafkaUtils)? Will
>>> we
>>> be able to do state-full operations on a Dataset[T] like we do today
>>> using
>>> MapWithStateRDD? Or will there be a subset of operations that the
>>> catalyst
>>> optimizer can understand such as aggregate and such?
>>>
>>> I'd be happy anyone could shed some light on this.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Yuval Itzchakov <yu...@gmail.com>.

Hi Ofir,
Thanks for the elaborated answer. I have read both documents, where they do
a light touch on infinite Dataframes/Datasets. However, they do not go in
depth as regards to how existing transformations on DStreams, for example,
will be transformed into the Dataset APIs. I've been browsing the 2.0
branch and have yet been able to understand how they correlate.

Also, placing SparkSession in the sql package seems like a peculiar choice,
since this is going to be the global abstraction over
SparkContext/StreamingContext from now on.

On Sun, May 15, 2016, 23:42 Ofir Manor <of...@equalum.io> wrote:

> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com>
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: Structured Streaming in Spark 2.0 and DStreams

Posted by Ofir Manor <of...@equalum.io>.

Hi Yuval,
let me share my understanding based on similar questions I had.
First, Spark 2.x aims to replace a whole bunch of its APIs with just two
main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
(merging of Dataset and Dataframe - which is why it inherits all the
SparkSQL goodness), while RDD seems as a low-level API only for special
cases. The new Dataset should also support both batch and streaming -
replacing (eventually) DStream as well. See the design docs in SPARK-13485
(unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
However, as you noted, not all will be fully delivered in 2.0. For example,
it seems that streaming from / to Kafka using StructuredStreaming didn't
make it (so far?) to 2.0 (which is a showstopper for me).
Anyway, as far as I understand, you should be able to apply stateful
operators (non-RDD) on Datasets (for example, the new event-time window
processing SPARK-8360). The gap I see is mostly limited streaming sources /
sinks migrated to the new (richer) API and semantics.
Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples
will align with the current offering...

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yu...@gmail.com> wrote:

> I've been reading/watching videos about the upcoming Spark 2.0 release
> which
> brings us Structured Streaming. One thing I've yet to understand is how
> this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
>
> All examples I can find, in the Spark repository/different videos is
> someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql,
> so
> this gives me a hunch that this is somehow all related to SQL and the
> likes,
> and not really to DStreams.
>
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
>
> I'd be happy anyone could shed some light on this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>