You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "ashish.usoni" <as...@gmail.com> on 2015/03/18 18:19:34 UTC

mapPartitions - How Does it Works

I am trying to understand about mapPartitions but i am still not sure how it
works

in the below example it create three partition 
val parallel = sc.parallelize(1 to 10, 3)

and when we do below 
parallel.mapPartitions( x => List(x.next).iterator).collect

it prints value 
Array[Int] = Array(1, 4, 7)

Can some one please explain why it prints 1,4,7 only

Thanks,




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does-it-Works-tp22123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: mapPartitions - How Does it Works

Posted by "Alex Turner (TMS)" <al...@toyota.com>.

List(x.next).iterator is giving you the first element from each partition,
which would be 1, 4 and 7 respectively.

On 3/18/15, 10:19 AM, "ashish.usoni" <as...@gmail.com> wrote:

>I am trying to understand about mapPartitions but i am still not sure how
>it
>works
>
>in the below example it create three partition
>val parallel = sc.parallelize(1 to 10, 3)
>
>and when we do below
>parallel.mapPartitions( x => List(x.next).iterator).collect
>
>it prints value 
>Array[Int] = Array(1, 4, 7)
>
>Can some one please explain why it prints 1,4,7 only
>
>Thanks,
>
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does
>-it-Works-tp22123.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>For additional commands, e-mail: user-help@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: mapPartitions - How Does it Works

Posted by "Ganelin, Ilya" <Il...@capitalone.com>.

Map partitions works as follows :
1) For each partition of your RDD, it provides an iterator over the values
within that partition
2) You then define a function that operates on that iterator

Thus if you do the following:

val parallel = sc.parallelize(1 to 10, 3)

parallel.mapPartitions( x => x.map(s => s + 1)).collect

You would get:
res3: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

In your example, x is not a pointer that traverses the iterator (e.g. With
.next()) , it¹s literally the Iterable object itself.
On 3/18/15, 10:19 AM, "ashish.usoni" <as...@gmail.com> wrote:

>I am trying to understand about mapPartitions but i am still not sure how
>it
>works
>
>in the below example it create three partition
>val parallel = sc.parallelize(1 to 10, 3)
>
>and when we do below
>parallel.mapPartitions( x => List(x.next).iterator).collect
>
>it prints value 
>Array[Int] = Array(1, 4, 7)
>
>Can some one please explain why it prints 1,4,7 only
>
>Thanks,
>
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does
>-it-Works-tp22123.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>For additional commands, e-mail: user-help@spark.apache.org
>

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: mapPartitions - How Does it Works

Posted by java8964 <ja...@hotmail.com>.

Here is what I think:
mapPartitions is for a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The combined result iterators are automatically converted into a new RDD.
So in this case, the RDD (1,2,...., 10) is split as 3 partitions, (1,2,3), (4,5,6), (7,8,9,10).
For every partition, your function is the get the first element as x.next, using it to build a list, return the iterator from the List.
So each partition will return (1), (4) and (7) as 3 iterator, then combine to one final RDD (1, 4, 7).
Yong

> Date: Wed, 18 Mar 2015 10:19:34 -0700
> From: ashish.usoni@gmail.com
> To: user@spark.apache.org
> Subject: mapPartitions - How Does it Works
> 
> I am trying to understand about mapPartitions but i am still not sure how it
> works
> 
> in the below example it create three partition 
> val parallel = sc.parallelize(1 to 10, 3)
> 
> and when we do below 
> parallel.mapPartitions( x => List(x.next).iterator).collect
> 
> it prints value 
> Array[Int] = Array(1, 4, 7)
> 
> Can some one please explain why it prints 1,4,7 only
> 
> Thanks,
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does-it-Works-tp22123.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

Re: mapPartitions - How Does it Works

Posted by Sabarish Sasidharan <sa...@manthan.com>.

Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the  entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator x can return all the rows in the partition. But you are returning
back after consuming the first row. Hence you see only 1,4,7 in your
output. These are the first rows of each of your 3 partitions.

Regards
Sab
On 18-Mar-2015 10:50 pm, "ashish.usoni" <as...@gmail.com> wrote:

> I am trying to understand about mapPartitions but i am still not sure how
> it
> works
>
> in the below example it create three partition
> val parallel = sc.parallelize(1 to 10, 3)
>
> and when we do below
> parallel.mapPartitions( x => List(x.next).iterator).collect
>
> it prints value
> Array[Int] = Array(1, 4, 7)
>
> Can some one please explain why it prints 1,4,7 only
>
> Thanks,
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does-it-Works-tp22123.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>