You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by dgoldenberg <dg...@gmail.com> on 2015/07/08 15:42:58 UTC

foreachRDD vs. forearchPartition ?

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: foreachRDD vs. forearchPartition ?

Posted by Evo Eftimov <ev...@isecc.com>.

For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems like message brokers,
databases and file systems - that increases the level of parallelism of your
output processing 

-----Original Message-----
From: dgoldenberg [mailto:dgoldenberg123@gmail.com] 
Sent: Wednesday, July 8, 2015 2:43 PM
To: user@spark.apache.org
Subject: foreachRDD vs. forearchPartition ?

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPa
rtition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
commands, e-mail: user-help@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

Posted by Dmitry Goldenberg <dg...@gmail.com>.

Thanks, Cody. The "good boy" comment wasn't from me :)  I was the one
asking for help.



On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger <co...@koeninger.org> wrote:

> Sean already answered your question.  foreachRDD and foreachPartition are
> completely different, there's nothing fuzzy or insufficient about that
> answer.  The fact that you can call foreachPartition on an rdd within the
> scope of foreachRDD should tell you that they aren't in any way comparable.
>
> I'm not sure if your rudeness ("be a good boy"...really?) is intentional
> or not.  If you're asking for help from people that are in most cases
> donating their time, I'd suggest that you'll have more success with a
> little more politeness.
>
> On Wed, Jul 8, 2015 at 9:05 AM, Evo Eftimov <ev...@isecc.com> wrote:
>
>> That was a) fuzzy b) insufficient – one can certainly use forach (only)
>> on DStream RDDs – it works as empirical observation
>>
>>
>>
>> As another empirical observation:
>>
>>
>>
>> For each partition results in having one instance of the lambda/closure
>> per partition when e.g. publishing to output systems like message brokers,
>> databases and file systems - that increases the level of parallelism of
>> your output processing
>>
>>
>>
>> As an architect I deal with gazillions of products and don’t have time to
>> read the source code of all of them to make up for documentation
>> deficiencies. On the other hand I believe you have been involved in writing
>> some of the code so be a good boy and either answer this question properly
>> or enhance the product documentation of that area of the system
>>
>>
>>
>> *From:* Sean Owen [mailto:sowen@cloudera.com]
>> *Sent:* Wednesday, July 8, 2015 2:52 PM
>> *To:* dgoldenberg; user@spark.apache.org
>> *Subject:* Re: foreachRDD vs. forearchPartition ?
>>
>>
>>
>> These are quite different operations. One operates on RDDs in  DStream
>> and one operates on partitions of an RDD. They are not alternatives.
>>
>>
>>
>> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com>
>> wrote:
>>
>> Is there a set of best practices for when to use foreachPartition vs.
>> foreachRDD?
>>
>> Is it generally true that using foreachPartition avoids some of the
>> over-network data shuffling overhead?
>>
>> When would I definitely want to use one method vs. the other?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: foreachRDD vs. forearchPartition ?

Posted by Cody Koeninger <co...@koeninger.org>.

Sean already answered your question.  foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient about that
answer.  The fact that you can call foreachPartition on an rdd within the
scope of foreachRDD should tell you that they aren't in any way comparable.

I'm not sure if your rudeness ("be a good boy"...really?) is intentional or
not.  If you're asking for help from people that are in most cases donating
their time, I'd suggest that you'll have more success with a little more
politeness.

On Wed, Jul 8, 2015 at 9:05 AM, Evo Eftimov <ev...@isecc.com> wrote:

> That was a) fuzzy b) insufficient – one can certainly use forach (only) on
> DStream RDDs – it works as empirical observation
>
>
>
> As another empirical observation:
>
>
>
> For each partition results in having one instance of the lambda/closure
> per partition when e.g. publishing to output systems like message brokers,
> databases and file systems - that increases the level of parallelism of
> your output processing
>
>
>
> As an architect I deal with gazillions of products and don’t have time to
> read the source code of all of them to make up for documentation
> deficiencies. On the other hand I believe you have been involved in writing
> some of the code so be a good boy and either answer this question properly
> or enhance the product documentation of that area of the system
>
>
>
> *From:* Sean Owen [mailto:sowen@cloudera.com]
> *Sent:* Wednesday, July 8, 2015 2:52 PM
> *To:* dgoldenberg; user@spark.apache.org
> *Subject:* Re: foreachRDD vs. forearchPartition ?
>
>
>
> These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives.
>
>
>
> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com> wrote:
>
> Is there a set of best practices for when to use foreachPartition vs.
> foreachRDD?
>
> Is it generally true that using foreachPartition avoids some of the
> over-network data shuffling overhead?
>
> When would I definitely want to use one method vs. the other?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: foreachRDD vs. forearchPartition ?

Posted by Evo Eftimov <ev...@isecc.com>.

That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation  

 

As another empirical observation:

 

For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like message brokers, databases and file systems - that increases the level of parallelism of your output processing 

 

As an architect I deal with gazillions of products and don’t have time to read the source code of all of them to make up for documentation deficiencies. On the other hand I believe you have been involved in writing some of the code so be a good boy and either answer this question properly or enhance the product documentation of that area of the system 

 

From: Sean Owen [mailto:sowen@cloudera.com] 
Sent: Wednesday, July 8, 2015 2:52 PM
To: dgoldenberg; user@spark.apache.org
Subject: Re: foreachRDD vs. forearchPartition ?

 

These are quite different operations. One operates on RDDs in  DStream and one operates on partitions of an RDD. They are not alternatives. 

 

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com> wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

Posted by Tathagata Das <td...@databricks.com>.

This is also discussed in the programming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg <dg...@gmail.com>
wrote:

> Thanks, Sean.
>
> "are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing"
>
> This is basically what I was looking for.
>
>
> On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> @Evo There is no foreachRDD operation on RDDs; it is a method of
>> DStream. It gives each RDD in the stream. RDD has a foreach, and
>> foreachPartition. These give elements of an RDD. What do you mean it
>> 'works' to call foreachRDD on an RDD?
>>
>> @Dmitry are you asking about foreach vs foreachPartition? that's quite
>> different. foreachPartition does not give more parallelism but lets
>> you operate on a whole batch of data at once, which is nice if you
>> need to allocate some expensive resource to do the processing.
>>
>> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
>> <dg...@gmail.com> wrote:
>> > "These are quite different operations. One operates on RDDs in  DStream
>> and
>> > one operates on partitions of an RDD. They are not alternatives."
>> >
>> > Sean, different operations as they are, they can certainly be used on
>> the
>> > same data set.  In that sense, they are alternatives. Code can be
>> written
>> > using one or the other which reaches the same effect - likely at a
>> different
>> > efficiency cost.
>> >
>> > The question is, what are the effects of applying one vs. the other?
>> >
>> > My specific scenario is, I'm streaming data out of Kafka.  I want to
>> perform
>> > a few transformations then apply an action which results in e.g. writing
>> > this data to Solr.  According to Evo, my best bet is foreachPartition
>> > because of increased parallelism (which I'd need to grok to understand
>> the
>> > details of what that means).
>> >
>> > Another scenario is, I've done a few transformations and send a result
>> > somewhere, e.g. I write a message into a socket.  Let's say I have one
>> > socket per a client of my streaming app and I get a host:port of that
>> socket
>> > as part of the message and want to send the response via that socket.
>> Is
>> > foreachPartition still a better choice?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> These are quite different operations. One operates on RDDs in  DStream
>> and
>> >> one operates on partitions of an RDD. They are not alternatives.
>> >>
>> >>
>> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com>
>> wrote:
>> >>>
>> >>> Is there a set of best practices for when to use foreachPartition vs.
>> >>> foreachRDD?
>> >>>
>> >>> Is it generally true that using foreachPartition avoids some of the
>> >>> over-network data shuffling overhead?
>> >>>
>> >>> When would I definitely want to use one method vs. the other?
>> >>>
>> >>> Thanks.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >>> For additional commands, e-mail: user-help@spark.apache.org
>> >>>
>> >
>>
>
>

Re: foreachRDD vs. forearchPartition ?

Posted by Dmitry Goldenberg <dg...@gmail.com>.

Thanks, Sean.

"are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing"

This is basically what I was looking for.


On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <so...@cloudera.com> wrote:

> @Evo There is no foreachRDD operation on RDDs; it is a method of
> DStream. It gives each RDD in the stream. RDD has a foreach, and
> foreachPartition. These give elements of an RDD. What do you mean it
> 'works' to call foreachRDD on an RDD?
>
> @Dmitry are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing.
>
> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
> <dg...@gmail.com> wrote:
> > "These are quite different operations. One operates on RDDs in  DStream
> and
> > one operates on partitions of an RDD. They are not alternatives."
> >
> > Sean, different operations as they are, they can certainly be used on the
> > same data set.  In that sense, they are alternatives. Code can be written
> > using one or the other which reaches the same effect - likely at a
> different
> > efficiency cost.
> >
> > The question is, what are the effects of applying one vs. the other?
> >
> > My specific scenario is, I'm streaming data out of Kafka.  I want to
> perform
> > a few transformations then apply an action which results in e.g. writing
> > this data to Solr.  According to Evo, my best bet is foreachPartition
> > because of increased parallelism (which I'd need to grok to understand
> the
> > details of what that means).
> >
> > Another scenario is, I've done a few transformations and send a result
> > somewhere, e.g. I write a message into a socket.  Let's say I have one
> > socket per a client of my streaming app and I get a host:port of that
> socket
> > as part of the message and want to send the response via that socket.  Is
> > foreachPartition still a better choice?
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> These are quite different operations. One operates on RDDs in  DStream
> and
> >> one operates on partitions of an RDD. They are not alternatives.
> >>
> >>
> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com>
> wrote:
> >>>
> >>> Is there a set of best practices for when to use foreachPartition vs.
> >>> foreachRDD?
> >>>
> >>> Is it generally true that using foreachPartition avoids some of the
> >>> over-network data shuffling overhead?
> >>>
> >>> When would I definitely want to use one method vs. the other?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: user-help@spark.apache.org
> >>>
> >
>

Re: foreachRDD vs. forearchPartition ?

Posted by Sean Owen <so...@cloudera.com>.

@Evo There is no foreachRDD operation on RDDs; it is a method of
DStream. It gives each RDD in the stream. RDD has a foreach, and
foreachPartition. These give elements of an RDD. What do you mean it
'works' to call foreachRDD on an RDD?

@Dmitry are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing.

On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
<dg...@gmail.com> wrote:
> "These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives."
>
> Sean, different operations as they are, they can certainly be used on the
> same data set.  In that sense, they are alternatives. Code can be written
> using one or the other which reaches the same effect - likely at a different
> efficiency cost.
>
> The question is, what are the effects of applying one vs. the other?
>
> My specific scenario is, I'm streaming data out of Kafka.  I want to perform
> a few transformations then apply an action which results in e.g. writing
> this data to Solr.  According to Evo, my best bet is foreachPartition
> because of increased parallelism (which I'd need to grok to understand the
> details of what that means).
>
> Another scenario is, I've done a few transformations and send a result
> somewhere, e.g. I write a message into a socket.  Let's say I have one
> socket per a client of my streaming app and I get a host:port of that socket
> as part of the message and want to send the response via that socket.  Is
> foreachPartition still a better choice?
>
>
>
>
>
>
>
>
> On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> These are quite different operations. One operates on RDDs in  DStream and
>> one operates on partitions of an RDD. They are not alternatives.
>>
>>
>> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com> wrote:
>>>
>>> Is there a set of best practices for when to use foreachPartition vs.
>>> foreachRDD?
>>>
>>> Is it generally true that using foreachPartition avoids some of the
>>> over-network data shuffling overhead?
>>>
>>> When would I definitely want to use one method vs. the other?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

Posted by Dmitry Goldenberg <dg...@gmail.com>.

"These are quite different operations. One operates on RDDs in  DStream and
one operates on partitions of an RDD. They are not alternatives."

Sean, different operations as they are, they can certainly be used on the
same data set.  In that sense, they are alternatives. Code can be written
using one or the other which reaches the same effect - likely at a
different efficiency cost.

The question is, what are the effects of applying one vs. the other?

My specific scenario is, I'm streaming data out of Kafka.  I want to
perform a few transformations then apply an action which results in e.g.
writing this data to Solr.  According to Evo, my best bet is
foreachPartition because of increased parallelism (which I'd need to grok
to understand the details of what that means).

Another scenario is, I've done a few transformations and send a result
somewhere, e.g. I write a message into a socket.  Let's say I have one
socket per a client of my streaming app and I get a host:port of that
socket as part of the message and want to send the response via that
socket.  Is foreachPartition still a better choice?

On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:

> These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives.
>
> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com> wrote:
>
>> Is there a set of best practices for when to use foreachPartition vs.
>> foreachRDD?
>>
>> Is it generally true that using foreachPartition avoids some of the
>> over-network data shuffling overhead?
>>
>> When would I definitely want to use one method vs. the other?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: foreachRDD vs. forearchPartition ?

Posted by Sean Owen <so...@cloudera.com>.

These are quite different operations. One operates on RDDs in  DStream and
one operates on partitions of an RDD. They are not alternatives.

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dg...@gmail.com> wrote:

> Is there a set of best practices for when to use foreachPartition vs.
> foreachRDD?
>
> Is it generally true that using foreachPartition avoids some of the
> over-network data shuffling overhead?
>
> When would I definitely want to use one method vs. the other?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>