You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by DandyDev <de...@gmail.com> on 2016/09/13 13:25:20 UTC

Spark Streaming - dividing DStream into mini batches

Hi all!

When reading about Spark Streaming and its execution model, I see diagrams
like this a lot:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg> 

It does a fine job explaining how DStreams consist of micro batches that are
basically RDDs. There are however some things I don't understand:

- RDDs are distributed by design, but micro batches are conceptually small.
How/why are these micro batches distributed so that they need to be
implemented as RDD?
- The above image doesn't explain how Spark Streaming parallelizes data.
According to the image, a stream of events get broken into micro batches
over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a micro
batch, etc.). How does parallelism come into play here? Is it that even
within a "time slot" (eg. time 0 to 1) there can be so many events, that
multiple micro batches for that time slot will be created and distributed
across the executors?

Clarification would be helpful!

Daan



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Posted by Cody Koeninger <co...@koeninger.org>.

Yeah.  If you're looking to reduce over more than one microbatch/rdd,
there's also reduceByKeyAndWindow

On Thu, Sep 15, 2016 at 4:27 AM, Daan Debie <de...@gmail.com> wrote:
> I have another (semi-related) question: I see in the documentation that
> DStream has a transformation reduceByKey. Does this work on _all_ elements
> in the stream, as they're coming in, or is this a transformation per
> RDD/micro batch? I assume the latter, otherwise it would be more akin to
> updateStateByKey, right?
>
> On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> The DStream implementation decides how to produce an RDD for a time
>> (this is the compute method)
>>
>> The RDD implementation decides how to partition things (this is the
>> getPartitions method)
>>
>> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
>> respectively if you want to see an example
>>
>> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <de...@gmail.com> wrote:
>> > Ah, that makes it much clearer, thanks!
>> >
>> > It also brings up an additional question: who/what decides on the
>> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
>> > into
>> > more than 1 partition based on size? Or is it something that the
>> > "source"
>> > (SocketStream, KafkaStream etc.) decides?
>> >
>> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> A micro batch is an RDD.
>> >>
>> >> An RDD has partitions, so different executors can work on different
>> >> partitions concurrently.
>> >>
>> >> Don't think of that as multiple micro-batches within a time slot.
>> >> It's one RDD within a time slot, with multiple partitions.
>> >>
>> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com>
>> >> wrote:
>> >> > Thanks, but that thread does not answer my questions, which are about
>> >> > the
>> >> > distributed nature of RDDs vs the small nature of "micro batches" and
>> >> > on
>> >> > how
>> >> > Spark Streaming distributes work.
>> >> >
>> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
>> >> > <mi...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Daan,
>> >> >>
>> >> >> You may find this link Re: Is "spark streaming" streaming or
>> >> >> mini-batch?
>> >> >> helpful. This was a thread in this forum not long ago.
>> >> >>
>> >> >> HTH
>> >> >>
>> >> >> Dr Mich Talebzadeh
>> >> >>
>> >> >>
>> >> >>
>> >> >> LinkedIn
>> >> >>
>> >> >>
>> >> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://talebzadehmich.wordpress.com
>> >> >>
>> >> >>
>> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
>> >> >> any
>> >> >> loss, damage or destruction of data or any other property which may
>> >> >> arise
>> >> >> from relying on this email's technical content is explicitly
>> >> >> disclaimed. The
>> >> >> author will in no case be liable for any monetary damages arising
>> >> >> from
>> >> >> such
>> >> >> loss, damage or destruction.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi all!
>> >> >>>
>> >> >>> When reading about Spark Streaming and its execution model, I see
>> >> >>> diagrams
>> >> >>> like this a lot:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>> >> >>>
>> >> >>> It does a fine job explaining how DStreams consist of micro batches
>> >> >>> that
>> >> >>> are
>> >> >>> basically RDDs. There are however some things I don't understand:
>> >> >>>
>> >> >>> - RDDs are distributed by design, but micro batches are
>> >> >>> conceptually
>> >> >>> small.
>> >> >>> How/why are these micro batches distributed so that they need to be
>> >> >>> implemented as RDD?
>> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
>> >> >>> data.
>> >> >>> According to the image, a stream of events get broken into micro
>> >> >>> batches
>> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
>> >> >>> a
>> >> >>> micro
>> >> >>> batch, etc.). How does parallelism come into play here? Is it that
>> >> >>> even
>> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
>> >> >>> that
>> >> >>> multiple micro batches for that time slot will be created and
>> >> >>> distributed
>> >> >>> across the executors?
>> >> >>>
>> >> >>> Clarification would be helpful!
>> >> >>>
>> >> >>> Daan
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> View this message in context:
>> >> >>>
>> >> >>>
>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>> >> >>> Sent from the Apache Spark User List mailing list archive at
>> >> >>> Nabble.com.
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Posted by Daan Debie <de...@gmail.com>.

I have another (semi-related) question: I see in the documentation that
DStream has a transformation reduceByKey. Does this work on _all_ elements
in the stream, as they're coming in, or is this a transformation per
RDD/micro batch? I assume the latter, otherwise it would be more akin to
updateStateByKey, right?

On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <co...@koeninger.org> wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <de...@gmail.com> wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com>
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > <mi...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> ------------------------------------------------------------
> ---------
> >> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Spark Streaming - dividing DStream into mini batches

Posted by Daan Debie <de...@gmail.com>.

Thanks for the awesome explanation! It's super clear to me now :)

On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <co...@koeninger.org> wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <de...@gmail.com> wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com>
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > <mi...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> ------------------------------------------------------------
> ---------
> >> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Spark Streaming - dividing DStream into mini batches

Posted by Cody Koeninger <co...@koeninger.org>.

The DStream implementation decides how to produce an RDD for a time
(this is the compute method)

The RDD implementation decides how to partition things (this is the
getPartitions method)

You can look at those methods in DirectKafkaInputDStream and KafkaRDD
respectively if you want to see an example

On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <de...@gmail.com> wrote:
> Ah, that makes it much clearer, thanks!
>
> It also brings up an additional question: who/what decides on the
> partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
> more than 1 partition based on size? Or is it something that the "source"
> (SocketStream, KafkaStream etc.) decides?
>
> On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> A micro batch is an RDD.
>>
>> An RDD has partitions, so different executors can work on different
>> partitions concurrently.
>>
>> Don't think of that as multiple micro-batches within a time slot.
>> It's one RDD within a time slot, with multiple partitions.
>>
>> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com> wrote:
>> > Thanks, but that thread does not answer my questions, which are about
>> > the
>> > distributed nature of RDDs vs the small nature of "micro batches" and on
>> > how
>> > Spark Streaming distributes work.
>> >
>> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
>> > <mi...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Daan,
>> >>
>> >> You may find this link Re: Is "spark streaming" streaming or
>> >> mini-batch?
>> >> helpful. This was a thread in this forum not long ago.
>> >>
>> >> HTH
>> >>
>> >> Dr Mich Talebzadeh
>> >>
>> >>
>> >>
>> >> LinkedIn
>> >>
>> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>
>> >>
>> >>
>> >> http://talebzadehmich.wordpress.com
>> >>
>> >>
>> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> >> loss, damage or destruction of data or any other property which may
>> >> arise
>> >> from relying on this email's technical content is explicitly
>> >> disclaimed. The
>> >> author will in no case be liable for any monetary damages arising from
>> >> such
>> >> loss, damage or destruction.
>> >>
>> >>
>> >>
>> >>
>> >> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com> wrote:
>> >>>
>> >>> Hi all!
>> >>>
>> >>> When reading about Spark Streaming and its execution model, I see
>> >>> diagrams
>> >>> like this a lot:
>> >>>
>> >>>
>> >>>
>> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>> >>>
>> >>> It does a fine job explaining how DStreams consist of micro batches
>> >>> that
>> >>> are
>> >>> basically RDDs. There are however some things I don't understand:
>> >>>
>> >>> - RDDs are distributed by design, but micro batches are conceptually
>> >>> small.
>> >>> How/why are these micro batches distributed so that they need to be
>> >>> implemented as RDD?
>> >>> - The above image doesn't explain how Spark Streaming parallelizes
>> >>> data.
>> >>> According to the image, a stream of events get broken into micro
>> >>> batches
>> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>> >>> micro
>> >>> batch, etc.). How does parallelism come into play here? Is it that
>> >>> even
>> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
>> >>> that
>> >>> multiple micro batches for that time slot will be created and
>> >>> distributed
>> >>> across the executors?
>> >>>
>> >>> Clarification would be helpful!
>> >>>
>> >>> Daan
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>>
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Posted by Daan Debie <de...@gmail.com>.

Ah, that makes it much clearer, thanks!

It also brings up an additional question: who/what decides on the
partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
more than 1 partition based on size? Or is it something that the "source"
(SocketStream, KafkaStream etc.) decides?

On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <co...@koeninger.org> wrote:

> A micro batch is an RDD.
>
> An RDD has partitions, so different executors can work on different
> partitions concurrently.
>
> Don't think of that as multiple micro-batches within a time slot.
> It's one RDD within a time slot, with multiple partitions.
>
> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com> wrote:
> > Thanks, but that thread does not answer my questions, which are about the
> > distributed nature of RDDs vs the small nature of "micro batches" and on
> how
> > Spark Streaming distributes work.
> >
> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> > wrote:
> >>
> >> Hi Daan,
> >>
> >> You may find this link Re: Is "spark streaming" streaming or mini-batch?
> >> helpful. This was a thread in this forum not long ago.
> >>
> >> HTH
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com> wrote:
> >>>
> >>> Hi all!
> >>>
> >>> When reading about Spark Streaming and its execution model, I see
> >>> diagrams
> >>> like this a lot:
> >>>
> >>>
> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >>>
> >>> It does a fine job explaining how DStreams consist of micro batches
> that
> >>> are
> >>> basically RDDs. There are however some things I don't understand:
> >>>
> >>> - RDDs are distributed by design, but micro batches are conceptually
> >>> small.
> >>> How/why are these micro batches distributed so that they need to be
> >>> implemented as RDD?
> >>> - The above image doesn't explain how Spark Streaming parallelizes
> data.
> >>> According to the image, a stream of events get broken into micro
> batches
> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
> >>> micro
> >>> batch, etc.). How does parallelism come into play here? Is it that even
> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> that
> >>> multiple micro batches for that time slot will be created and
> distributed
> >>> across the executors?
> >>>
> >>> Clarification would be helpful!
> >>>
> >>> Daan
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>>
> >>
> >
>

Re: Spark Streaming - dividing DStream into mini batches

Posted by Cody Koeninger <co...@koeninger.org>.

A micro batch is an RDD.

An RDD has partitions, so different executors can work on different
partitions concurrently.

Don't think of that as multiple micro-batches within a time slot.
It's one RDD within a time slot, with multiple partitions.

On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <de...@gmail.com> wrote:
> Thanks, but that thread does not answer my questions, which are about the
> distributed nature of RDDs vs the small nature of "micro batches" and on how
> Spark Streaming distributes work.
>
> On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>>
>> Hi Daan,
>>
>> You may find this link Re: Is "spark streaming" streaming or mini-batch?
>> helpful. This was a thread in this forum not long ago.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com> wrote:
>>>
>>> Hi all!
>>>
>>> When reading about Spark Streaming and its execution model, I see
>>> diagrams
>>> like this a lot:
>>>
>>>
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
>>>
>>> It does a fine job explaining how DStreams consist of micro batches that
>>> are
>>> basically RDDs. There are however some things I don't understand:
>>>
>>> - RDDs are distributed by design, but micro batches are conceptually
>>> small.
>>> How/why are these micro batches distributed so that they need to be
>>> implemented as RDD?
>>> - The above image doesn't explain how Spark Streaming parallelizes data.
>>> According to the image, a stream of events get broken into micro batches
>>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>>> micro
>>> batch, etc.). How does parallelism come into play here? Is it that even
>>> within a "time slot" (eg. time 0 to 1) there can be so many events, that
>>> multiple micro batches for that time slot will be created and distributed
>>> across the executors?
>>>
>>> Clarification would be helpful!
>>>
>>> Daan
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Streaming - dividing DStream into mini batches

Posted by Daan Debie <de...@gmail.com>.

Thanks, but that thread does not answer my questions, which are about the
distributed nature of RDDs vs the small nature of "micro batches" and on
how Spark Streaming distributes work.

On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Daan,
>
> You may find this link Re: Is "spark streaming" streaming or mini-batch?
> <https://www.mail-archive.com/user@spark.apache.org/msg55914.html>
> helpful. This was a thread in this forum not long ago.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 13 September 2016 at 14:25, DandyDev <de...@gmail.com> wrote:
>
>> Hi all!
>>
>> When reading about Spark Streaming and its execution model, I see diagrams
>> like this a lot:
>>
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n27699/lambda-architecture-with-spark-spark-streaming-
>> kafka-cassandra-akka-and-scala-31-638.jpg>
>>
>> It does a fine job explaining how DStreams consist of micro batches that
>> are
>> basically RDDs. There are however some things I don't understand:
>>
>> - RDDs are distributed by design, but micro batches are conceptually
>> small.
>> How/why are these micro batches distributed so that they need to be
>> implemented as RDD?
>> - The above image doesn't explain how Spark Streaming parallelizes data.
>> According to the image, a stream of events get broken into micro batches
>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a
>> micro
>> batch, etc.). How does parallelism come into play here? Is it that even
>> within a "time slot" (eg. time 0 to 1) there can be so many events, that
>> multiple micro batches for that time slot will be created and distributed
>> across the executors?
>>
>> Clarification would be helpful!
>>
>> Daan
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-
>> mini-batches-tp27699.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Spark Streaming - dividing DStream into mini batches

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Daan,

You may find this link Re: Is "spark streaming" streaming or mini-batch?
<https://www.mail-archive.com/user@spark.apache.org/msg55914.html>
helpful. This was a thread in this forum not long ago.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 September 2016 at 14:25, DandyDev <de...@gmail.com> wrote:

> Hi all!
>
> When reading about Spark Streaming and its execution model, I see diagrams
> like this a lot:
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-
> architecture-with-spark-spark-streaming-kafka-cassandra-
> akka-and-scala-31-638.jpg>
>
> It does a fine job explaining how DStreams consist of micro batches that
> are
> basically RDDs. There are however some things I don't understand:
>
> - RDDs are distributed by design, but micro batches are conceptually small.
> How/why are these micro batches distributed so that they need to be
> implemented as RDD?
> - The above image doesn't explain how Spark Streaming parallelizes data.
> According to the image, a stream of events get broken into micro batches
> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a micro
> batch, etc.). How does parallelism come into play here? Is it that even
> within a "time slot" (eg. time 0 to 1) there can be so many events, that
> multiple micro batches for that time slot will be created and distributed
> across the executors?
>
> Clarification would be helpful!
>
> Daan
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-
> into-mini-batches-tp27699.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>