You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by S <sh...@gmail.com> on 2021/05/31 18:31:06 UTC

Spark Structured Streaming

Hi,

I am using Structured Streaming on Azure HdInsight. The version is 2.4.6.

I am trying to understand the microbatch mode - default and fixed
intervals. Does the fixed interval microbatch follow something similar to
receiver based model where records keep getting pulled and stored into
blocks for the duration of the interval at the end of which a job is kicked
off? Or, does the job just process the current microbatch and sleep for the
rest of the interval and pulls records only at the end of the interval?

I am fully aware of the two dstreams models - receiver and direct based
dstreams. I am just trying to figure out if either of these two models were
reused in Structured Streaming.

Regards,
Sheel

Re: Spark Structured Streaming

Posted by S <sh...@gmail.com>.
Hi Mich,

I agree with you; spark streaming will become defunct in favor of
Structured Streaming. And I have gone over the document in detail. I am
aware of the unbounded datasets and running aggregate etc..

Nevertheless, I wouldn't say it's a moot point as it provides a good
intuition of the evolution of spark streaming model into the Structured
Streaming Model. The microbatch model is a commonality between the dstream
model and the structuted streaming model but it is a bit unsettling to
think that the structured Streaming model would use the same old receiver
based approach to achieve fixed interval microbatch. It does however fit
the direct based approach. Just wanted to get this conclusion that I
arrived at verified by the broader contributing community.

Regards,
Sheel

On Tue, 1 Jun, 2021, 1:57 AM Mich Talebzadeh, <mi...@gmail.com>
wrote:

> Hi,
>
> I guess whether structured streaming (SS) inherited anything from spark
> streaming is a moot point now, although it is a concept built on spark
> streaming which will be defunct soon.
>
> Going forward, It all depends on what problem you are trying to address.
>
> These are explained in the following doc
> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>
> However, within SSS micro-batching, you have the concept of working out
> running aggregates within a given timeframe  akin to spark streaming with
> sliding window and window's length.
>
> The other one relies on a fixed triggering mechanism to invoke a function
> to perform some specific tasks (processing CDC, writing the result set to a
> database, working out average prices per security etc) on streaming data in
> that triggering period.
>
> To see a discussion on running aggregates please look for the following
> thread in this forum
>
> "Calculate average from Spark stream"
>
> And for triggering mechanism you can see an example in my linkedin  below
>
>
> https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 31 May 2021 at 19:32, S <sh...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using Structured Streaming on Azure HdInsight. The version is 2.4.6.
>>
>> I am trying to understand the microbatch mode - default and fixed
>> intervals. Does the fixed interval microbatch follow something similar to
>> receiver based model where records keep getting pulled and stored into
>> blocks for the duration of the interval at the end of which a job is kicked
>> off? Or, does the job just process the current microbatch and sleep for the
>> rest of the interval and pulls records only at the end of the interval?
>>
>> I am fully aware of the two dstreams models - receiver and direct based
>> dstreams. I am just trying to figure out if either of these two models were
>> reused in Structured Streaming.
>>
>> Regards,
>> Sheel
>>
>

Re: Spark Structured Streaming

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

I guess whether structured streaming (SS) inherited anything from spark
streaming is a moot point now, although it is a concept built on spark
streaming which will be defunct soon.

Going forward, It all depends on what problem you are trying to address.

These are explained in the following doc
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>

However, within SSS micro-batching, you have the concept of working out
running aggregates within a given timeframe  akin to spark streaming with
sliding window and window's length.

The other one relies on a fixed triggering mechanism to invoke a function
to perform some specific tasks (processing CDC, writing the result set to a
database, working out average prices per security etc) on streaming data in
that triggering period.

To see a discussion on running aggregates please look for the following
thread in this forum

"Calculate average from Spark stream"

And for triggering mechanism you can see an example in my linkedin  below

https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/


HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 31 May 2021 at 19:32, S <sh...@gmail.com> wrote:

> Hi,
>
> I am using Structured Streaming on Azure HdInsight. The version is 2.4.6.
>
> I am trying to understand the microbatch mode - default and fixed
> intervals. Does the fixed interval microbatch follow something similar to
> receiver based model where records keep getting pulled and stored into
> blocks for the duration of the interval at the end of which a job is kicked
> off? Or, does the job just process the current microbatch and sleep for the
> rest of the interval and pulls records only at the end of the interval?
>
> I am fully aware of the two dstreams models - receiver and direct based
> dstreams. I am just trying to figure out if either of these two models were
> reused in Structured Streaming.
>
> Regards,
> Sheel
>