You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by david <da...@free.fr> on 2014/12/05 14:26:10 UTC

spark streaming kafa best practices ?

hi,

  What is the bet way to process a batch window in SparkStreaming :

    kafkaStream.foreachRDD(rdd => {
      rdd.collect().foreach(event => {
        // process the event
        process(event)
      })
    })


Or 

    kafkaStream.foreachRDD(rdd => {
      rdd.map(event => {
        // process the event
        process(event)
      }).collect()
    })


thank's



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: spark streaming kafa best practices ?

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
> On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas <ge...@gmail.com>
> wrote:
> > I was wondering why one would choose for rdd.map vs rdd.foreach to
> execute a
> > side-effecting function on an RDD.
>

Personally, I like to get the count of processed items, so I do something
like
  rdd.map(item => processItem(item)).count()
instead of
  rdd.foreach(item => processItem(item))
but I would be happy to learn about a better way.

Tobias

Re: spark streaming kafa best practices ?

Posted by Patrick Wendell <pw...@gmail.com>.

Foreach is slightly more efficient because Spark doesn't bother to try
and collect results from each task since it's understood there will be
no return type. I think the difference is very marginal though - it's
mostly stylistic... typically you use foreach for something that is
intended to produce a side effect and map for something that will
return a new dataset.

On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas <ge...@gmail.com> wrote:
> Patrick,
>
> I was wondering why one would choose for rdd.map vs rdd.foreach to execute a
> side-effecting function on an RDD.
>
> -kr, Gerard.
>
> On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>> The second choice is better. Once you call collect() you are pulling
>> all of the data onto a single node, you want to do most of the
>> processing  in parallel on the cluster, which is what map() will do.
>> Ideally you'd try to summarize the data or reduce it before calling
>> collect().
>>
>> On Fri, Dec 5, 2014 at 5:26 AM, david <da...@free.fr> wrote:
>> > hi,
>> >
>> >   What is the bet way to process a batch window in SparkStreaming :
>> >
>> >     kafkaStream.foreachRDD(rdd => {
>> >       rdd.collect().foreach(event => {
>> >         // process the event
>> >         process(event)
>> >       })
>> >     })
>> >
>> >
>> > Or
>> >
>> >     kafkaStream.foreachRDD(rdd => {
>> >       rdd.map(event => {
>> >         // process the event
>> >         process(event)
>> >       }).collect()
>> >     })
>> >
>> >
>> > thank's
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: spark streaming kafa best practices ?

Posted by Gerard Maas <ge...@gmail.com>.

Patrick,

I was wondering why one would choose for rdd.map vs rdd.foreach to execute
a side-effecting function on an RDD.

-kr, Gerard.

On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
> The second choice is better. Once you call collect() you are pulling
> all of the data onto a single node, you want to do most of the
> processing  in parallel on the cluster, which is what map() will do.
> Ideally you'd try to summarize the data or reduce it before calling
> collect().
>
> On Fri, Dec 5, 2014 at 5:26 AM, david <da...@free.fr> wrote:
> > hi,
> >
> >   What is the bet way to process a batch window in SparkStreaming :
> >
> >     kafkaStream.foreachRDD(rdd => {
> >       rdd.collect().foreach(event => {
> >         // process the event
> >         process(event)
> >       })
> >     })
> >
> >
> > Or
> >
> >     kafkaStream.foreachRDD(rdd => {
> >       rdd.map(event => {
> >         // process the event
> >         process(event)
> >       }).collect()
> >     })
> >
> >
> > thank's
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: spark streaming kafa best practices ?

Posted by Patrick Wendell <pw...@gmail.com>.

The second choice is better. Once you call collect() you are pulling
all of the data onto a single node, you want to do most of the
processing  in parallel on the cluster, which is what map() will do.
Ideally you'd try to summarize the data or reduce it before calling
collect().

On Fri, Dec 5, 2014 at 5:26 AM, david <da...@free.fr> wrote:
> hi,
>
>   What is the bet way to process a batch window in SparkStreaming :
>
>     kafkaStream.foreachRDD(rdd => {
>       rdd.collect().foreach(event => {
>         // process the event
>         process(event)
>       })
>     })
>
>
> Or
>
>     kafkaStream.foreachRDD(rdd => {
>       rdd.map(event => {
>         // process the event
>         process(event)
>       }).collect()
>     })
>
>
> thank's
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org