You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Lars Francke <la...@gmail.com> on 2019/06/25 20:02:00 UTC

Spark Structured Streaming Custom Sources confusion

Hi,

I'm a bit confused about the current state and the future plans of custom
data sources in Structured Streaming.

So for DStreams we could write a Receiver as documented. Can this be used
with Structured Streaming?

Then we had the DataSource API with DefaultSource et. al. which was (in my
opinion) never properly documented.

With Spark 2.3 we got a new DataSourceV2 (which also was a marker
interface), also not properly documented.

Now with Spark 3 this seems to change again? (
https://issues.apache.org/jira/browse/SPARK-25390), at least the
DataSourceV2 interface is gone, still no documentation but still called v2
somehow?

Can anyone shed some light on the current state of data sources & sinks for
batch & streaming in Spark 2.4 and 3.x?

Thank you!

Cheers,
Lars

Re: Spark Structured Streaming Custom Sources confusion

Posted by Lars Francke <la...@gmail.com>.

Hi Gabor,

sure, the DSv2 seems to be undergoing backward-incompatible changes from
Spark 2 -> 3 though, right? That combined with the fact that the API is
pretty new still doesn't instill confidence in its stability (API wise I
mean).

Cheers,
Lars

On Fri, Jun 28, 2019 at 4:10 PM Gabor Somogyi <ga...@gmail.com>
wrote:

> Hi Lars,
>
> DSv2 already used in production.
>
> Documentation, well since Spark evolving fast I would take a look at how
> the built-in connectors implemented.
>
> BR?
> G
>
>
> On Fri, Jun 28, 2019 at 3:52 PM Lars Francke <la...@gmail.com>
> wrote:
>
>> Gabor,
>>
>> thank you. That is immensely helpful. DataSource v1 it is then. Does that
>> mean DSV2 is not really for production use yet?
>>
>> Any idea what the best documentation would be? I'd probably start by
>> looking at existing code.
>>
>> Cheers,
>> Lars
>>
>> On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi <ga...@gmail.com>
>> wrote:
>>
>>> Hi Lars,
>>>
>>> Since Structured Streaming doesn't support receivers at all so that
>>> source/sink can't be used.
>>>
>>> Data source v2 is under development and because of that it's a moving
>>> target so I suggest to implement it with v1 (unless special features are
>>> required from v2).
>>> Additionally since I've just adopted Kafka batch source/sink I can say
>>> it's doable to merge from v1 to v2 when time comes.
>>> (Please see https://github.com/apache/spark/pull/24738. Worth to
>>> mention this is batch and not streaming but there is a similar PR)
>>> Dropping v1 will not happen lightning fast in the near future though...
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Tue, Jun 25, 2019 at 10:02 PM Lars Francke <la...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm a bit confused about the current state and the future plans of
>>>> custom data sources in Structured Streaming.
>>>>
>>>> So for DStreams we could write a Receiver as documented. Can this be
>>>> used with Structured Streaming?
>>>>
>>>> Then we had the DataSource API with DefaultSource et. al. which was (in
>>>> my opinion) never properly documented.
>>>>
>>>> With Spark 2.3 we got a new DataSourceV2 (which also was a marker
>>>> interface), also not properly documented.
>>>>
>>>> Now with Spark 3 this seems to change again? (
>>>> https://issues.apache.org/jira/browse/SPARK-25390), at least the
>>>> DataSourceV2 interface is gone, still no documentation but still called v2
>>>> somehow?
>>>>
>>>> Can anyone shed some light on the current state of data sources & sinks
>>>> for batch & streaming in Spark 2.4 and 3.x?
>>>>
>>>> Thank you!
>>>>
>>>> Cheers,
>>>> Lars
>>>>
>>>

Re: Spark Structured Streaming Custom Sources confusion

Posted by Gabor Somogyi <ga...@gmail.com>.

Hi Lars,

DSv2 already used in production.

Documentation, well since Spark evolving fast I would take a look at how
the built-in connectors implemented.

BR?
G


On Fri, Jun 28, 2019 at 3:52 PM Lars Francke <la...@gmail.com> wrote:

> Gabor,
>
> thank you. That is immensely helpful. DataSource v1 it is then. Does that
> mean DSV2 is not really for production use yet?
>
> Any idea what the best documentation would be? I'd probably start by
> looking at existing code.
>
> Cheers,
> Lars
>
> On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi <ga...@gmail.com>
> wrote:
>
>> Hi Lars,
>>
>> Since Structured Streaming doesn't support receivers at all so that
>> source/sink can't be used.
>>
>> Data source v2 is under development and because of that it's a moving
>> target so I suggest to implement it with v1 (unless special features are
>> required from v2).
>> Additionally since I've just adopted Kafka batch source/sink I can say
>> it's doable to merge from v1 to v2 when time comes.
>> (Please see https://github.com/apache/spark/pull/24738. Worth to mention
>> this is batch and not streaming but there is a similar PR)
>> Dropping v1 will not happen lightning fast in the near future though...
>>
>> BR,
>> G
>>
>>
>> On Tue, Jun 25, 2019 at 10:02 PM Lars Francke <la...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm a bit confused about the current state and the future plans of
>>> custom data sources in Structured Streaming.
>>>
>>> So for DStreams we could write a Receiver as documented. Can this be
>>> used with Structured Streaming?
>>>
>>> Then we had the DataSource API with DefaultSource et. al. which was (in
>>> my opinion) never properly documented.
>>>
>>> With Spark 2.3 we got a new DataSourceV2 (which also was a marker
>>> interface), also not properly documented.
>>>
>>> Now with Spark 3 this seems to change again? (
>>> https://issues.apache.org/jira/browse/SPARK-25390), at least the
>>> DataSourceV2 interface is gone, still no documentation but still called v2
>>> somehow?
>>>
>>> Can anyone shed some light on the current state of data sources & sinks
>>> for batch & streaming in Spark 2.4 and 3.x?
>>>
>>> Thank you!
>>>
>>> Cheers,
>>> Lars
>>>
>>

Re: Spark Structured Streaming Custom Sources confusion

Posted by Lars Francke <la...@gmail.com>.

Gabor,

thank you. That is immensely helpful. DataSource v1 it is then. Does that
mean DSV2 is not really for production use yet?

Any idea what the best documentation would be? I'd probably start by
looking at existing code.

Cheers,
Lars

On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi <ga...@gmail.com>
wrote:

> Hi Lars,
>
> Since Structured Streaming doesn't support receivers at all so that
> source/sink can't be used.
>
> Data source v2 is under development and because of that it's a moving
> target so I suggest to implement it with v1 (unless special features are
> required from v2).
> Additionally since I've just adopted Kafka batch source/sink I can say
> it's doable to merge from v1 to v2 when time comes.
> (Please see https://github.com/apache/spark/pull/24738. Worth to mention
> this is batch and not streaming but there is a similar PR)
> Dropping v1 will not happen lightning fast in the near future though...
>
> BR,
> G
>
>
> On Tue, Jun 25, 2019 at 10:02 PM Lars Francke <la...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm a bit confused about the current state and the future plans of custom
>> data sources in Structured Streaming.
>>
>> So for DStreams we could write a Receiver as documented. Can this be used
>> with Structured Streaming?
>>
>> Then we had the DataSource API with DefaultSource et. al. which was (in
>> my opinion) never properly documented.
>>
>> With Spark 2.3 we got a new DataSourceV2 (which also was a marker
>> interface), also not properly documented.
>>
>> Now with Spark 3 this seems to change again? (
>> https://issues.apache.org/jira/browse/SPARK-25390), at least the
>> DataSourceV2 interface is gone, still no documentation but still called v2
>> somehow?
>>
>> Can anyone shed some light on the current state of data sources & sinks
>> for batch & streaming in Spark 2.4 and 3.x?
>>
>> Thank you!
>>
>> Cheers,
>> Lars
>>
>

Re: Spark Structured Streaming Custom Sources confusion

Posted by Gabor Somogyi <ga...@gmail.com>.

Hi Lars,

Since Structured Streaming doesn't support receivers at all so that
source/sink can't be used.

Data source v2 is under development and because of that it's a moving
target so I suggest to implement it with v1 (unless special features are
required from v2).
Additionally since I've just adopted Kafka batch source/sink I can say it's
doable to merge from v1 to v2 when time comes.
(Please see https://github.com/apache/spark/pull/24738. Worth to mention
this is batch and not streaming but there is a similar PR)
Dropping v1 will not happen lightning fast in the near future though...

BR,
G

On Tue, Jun 25, 2019 at 10:02 PM Lars Francke <la...@gmail.com>
wrote:

> Hi,
>
> I'm a bit confused about the current state and the future plans of custom
> data sources in Structured Streaming.
>
> So for DStreams we could write a Receiver as documented. Can this be used
> with Structured Streaming?
>
> Then we had the DataSource API with DefaultSource et. al. which was (in my
> opinion) never properly documented.
>
> With Spark 2.3 we got a new DataSourceV2 (which also was a marker
> interface), also not properly documented.
>
> Now with Spark 3 this seems to change again? (
> https://issues.apache.org/jira/browse/SPARK-25390), at least the
> DataSourceV2 interface is gone, still no documentation but still called v2
> somehow?
>
> Can anyone shed some light on the current state of data sources & sinks
> for batch & streaming in Spark 2.4 and 3.x?
>
> Thank you!
>
> Cheers,
> Lars
>