You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/03/24 13:57:03 UTC

mapPartitions use case

Dear all,

Sorry for asking such a basic question, but someone can explain when one
should use mapPartiontions instead of map.

Thanks

Jaonary

Re: mapPartitions use case

Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.
I've seen two cases most commonly:

The first is when I need to create some processing object to process each
record.  If that object creation is expensive, creating one per record
becomes prohibitive.  So instead, we use mapPartition, and create one per
partition, and use it on each record in the partition.

The other is I've often found it much more efficient, when summarizing
data, to use a mutable form of the summary object, running over each record
in a partition, then reduce those per-partition results, than to create a
summary object per record and reduce that much larger set pf summary
objects.  Again, it saves a lot of object creation.



On Mon, Mar 24, 2014 at 8:57 AM, Jaonary Rabarisoa <ja...@gmail.com>wrote:

> Dear all,
>
> Sorry for asking such a basic question, but someone can explain when one
> should use mapPartiontions instead of map.
>
> Thanks
>
> Jaonary
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com