You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Matthias Niehoff <ma...@codecentric.de> on 2016/07/13 20:35:13 UTC

Structured Streaming and Microbatches

Hi everybody,

as far as I understand with new the structured Streaming API the output
will not get processed every x seconds anymore. Instead the data will be
processed as soon as is arrived. But there might be a delay due to
processing time for the data.

A small example:
Data comes in and the processing takes 1 second (quite long)
In this 1 second a lot of new data come in which will be processed after
the processing of the first data finished.

My questions are:
Is the data for each processing, i.e all the data collected in the 1 second
still processed as a microbatch (included reprocessing in case of failure
on another worker, etc.)? Or is the bulk of data processed one by one?

With regards to the processing time: is the behavior the same for high
processing times as in spark 1.x? Meaning we get a scheduling delay, data
is stored by a receiver,.. (is there even a concept of receiver in Spark 2?
Is a source in streaming basically a receiver?)

Hope those questions aren’t to confusing :-)

Thank you!

Re: Structured Streaming and Microbatches

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

It's still microbatching architecture with triggers as batchIntervals. It's
just faster by default and the API is more pleasant, i.e. Dataset-driven.

Jacek
On 13 Jul 2016 10:35 p.m., "Matthias Niehoff" <
matthias.niehoff@codecentric.de> wrote:

Hi everybody,

as far as I understand with new the structured Streaming API the output
will not get processed every x seconds anymore. Instead the data will be
processed as soon as is arrived. But there might be a delay due to
processing time for the data.

A small example:
Data comes in and the processing takes 1 second (quite long)
In this 1 second a lot of new data come in which will be processed after
the processing of the first data finished.

My questions are:
Is the data for each processing, i.e all the data collected in the 1 second
still processed as a microbatch (included reprocessing in case of failure
on another worker, etc.)? Or is the bulk of data processed one by one?

With regards to the processing time: is the behavior the same for high
processing times as in spark 1.x? Meaning we get a scheduling delay, data
is stored by a receiver,.. (is there even a concept of receiver in Spark 2?
Is a source in streaming basically a receiver?)

Hope those questions aren’t to confusing :-)

Thank you!

Re: Structured Streaming and Microbatches

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

It's still microbatching architecture with triggers as batchIntervals. It's
just faster by default and the API is more pleasant, i.e. Dataset-driven.

Jacek
On 13 Jul 2016 10:35 p.m., "Matthias Niehoff" <
matthias.niehoff@codecentric.de> wrote:

Hi everybody,

as far as I understand with new the structured Streaming API the output
will not get processed every x seconds anymore. Instead the data will be
processed as soon as is arrived. But there might be a delay due to
processing time for the data.

A small example:
Data comes in and the processing takes 1 second (quite long)
In this 1 second a lot of new data come in which will be processed after
the processing of the first data finished.

My questions are:
Is the data for each processing, i.e all the data collected in the 1 second
still processed as a microbatch (included reprocessing in case of failure
on another worker, etc.)? Or is the bulk of data processed one by one?

With regards to the processing time: is the behavior the same for high
processing times as in spark 1.x? Meaning we get a scheduling delay, data
is stored by a receiver,.. (is there even a concept of receiver in Spark 2?
Is a source in streaming basically a receiver?)

Hope those questions aren’t to confusing :-)

Thank you!