You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2016/06/28 19:59:11 UTC

Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 without
a lot of details) that says "Note: You cannot apply any operators on `data`
except consuming it (e.g., `collect/foreach`)." but I'm wondering if this
restriction is perhaps too broadly worded? Provided that we consume the
data in a blocking fashion could we apply some other transformation
beforehand? Or is there a better way to get equivalent foreachRDD
functionality with the structured streaming API?

On somewhat of tangent - would it maybe make sense to mark transformations
on Datasets which are not supported for Streaming use (e.g. toJson etc.)?

Cheers,

Holden :)
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Posted by Michael Armbrust <mi...@databricks.com>.
Yeah, turning it into an RDD should preserve the incremental planning.

On Tue, Jun 28, 2016 at 6:30 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Ok, that makes sense (the JIRA where the restriction note was added didn't
> have a lot of details). So for now, would converting to an RDD inside of a
> custom Sink and then doing your operations on that be a reasonable work
> around?
>
>
> On Tuesday, June 28, 2016, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> This is not too broadly worded, and in general I would caution that any
>> interface in org.apache.spark.sql.catalyst or
>> org.apache.spark.sql.execution is considered internal and likely to change
>> in between releases.  We do plan to open a stable source/sink API in a
>> future release.
>>
>> The problem here is that the DataFrame is constructed using an
>> incrementalized physical query plan.  If you call any operations on the
>> Dataframe that change the logical plan, you will loose prior state and the
>> DataFrame will return an incorrect result.  Since this was discovered late
>> in the release process we decided it was better to document the current
>> behavior, rather than do a large refactoring.
>>
>> On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020
>>> without a lot of details) that says "Note: You cannot apply any operators
>>> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering
>>> if this restriction is perhaps too broadly worded? Provided that we consume
>>> the data in a blocking fashion could we apply some other transformation
>>> beforehand? Or is there a better way to get equivalent foreachRDD
>>> functionality with the structured streaming API?
>>>
>>> On somewhat of tangent - would it maybe make sense to mark
>>> transformations on Datasets which are not supported for Streaming use (e.g.
>>> toJson etc.)?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Posted by Holden Karau <ho...@pigscanfly.ca>.
Ok, that makes sense (the JIRA where the restriction note was added didn't
have a lot of details). So for now, would converting to an RDD inside of a
custom Sink and then doing your operations on that be a reasonable work
around?

On Tuesday, June 28, 2016, Michael Armbrust <mi...@databricks.com> wrote:

> This is not too broadly worded, and in general I would caution that any
> interface in org.apache.spark.sql.catalyst or
> org.apache.spark.sql.execution is considered internal and likely to change
> in between releases.  We do plan to open a stable source/sink API in a
> future release.
>
> The problem here is that the DataFrame is constructed using an
> incrementalized physical query plan.  If you call any operations on the
> Dataframe that change the logical plan, you will loose prior state and the
> DataFrame will return an incorrect result.  Since this was discovered late
> in the release process we decided it was better to document the current
> behavior, rather than do a large refactoring.
>
> On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <holden@pigscanfly.ca
> <javascript:_e(%7B%7D,'cvml','holden@pigscanfly.ca');>> wrote:
>
>> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020
>> without a lot of details) that says "Note: You cannot apply any operators
>> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering
>> if this restriction is perhaps too broadly worded? Provided that we consume
>> the data in a blocking fashion could we apply some other transformation
>> beforehand? Or is there a better way to get equivalent foreachRDD
>> functionality with the structured streaming API?
>>
>> On somewhat of tangent - would it maybe make sense to mark
>> transformations on Datasets which are not supported for Streaming use (e.g.
>> toJson etc.)?
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Posted by Michael Armbrust <mi...@databricks.com>.
This is not too broadly worded, and in general I would caution that any
interface in org.apache.spark.sql.catalyst or
org.apache.spark.sql.execution is considered internal and likely to change
in between releases.  We do plan to open a stable source/sink API in a
future release.

The problem here is that the DataFrame is constructed using an
incrementalized physical query plan.  If you call any operations on the
Dataframe that change the logical plan, you will loose prior state and the
DataFrame will return an incorrect result.  Since this was discovered late
in the release process we decided it was better to document the current
behavior, rather than do a large refactoring.

On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020
> without a lot of details) that says "Note: You cannot apply any operators
> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering
> if this restriction is perhaps too broadly worded? Provided that we consume
> the data in a blocking fashion could we apply some other transformation
> beforehand? Or is there a better way to get equivalent foreachRDD
> functionality with the structured streaming API?
>
> On somewhat of tangent - would it maybe make sense to mark transformations
> on Datasets which are not supported for Streaming use (e.g. toJson etc.)?
>
> Cheers,
>
> Holden :)
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>