You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by madeleine <ma...@gmail.com> on 2014/06/21 18:37:50 UTC

zip in pyspark truncates RDD to number of processors

Consider the following simple zip:

n = 6
a = sc.parallelize(range(n))
b = sc.parallelize(range(n)).map(lambda j: j) 
c = a.zip(b)
print a.count(), b.count(), c.count()

>> 6 6 4

by varying n, I find that c.count() is always min(n,4), where 4 happens to
be the number of threads on my computer. by calling c.collect(), I see the
RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
happen without calling map on b.

Any ideas?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: zip in pyspark truncates RDD to number of processors

Posted by Kan Zhang <kz...@apache.org>.
I couldn't reproduce your issue locally, but I suspect it has something to
do with partitioning. zip() does it by partition and it assumes the two
RDDs have the same number of partitions and the same number of elements in
each partition. By default, map() doesn't preserve partitioning. Try set
preservesPartitioning to True and see if the problem persists.


On Sat, Jun 21, 2014 at 9:37 AM, madeleine <ma...@gmail.com>
wrote:

> Consider the following simple zip:
>
> n = 6
> a = sc.parallelize(range(n))
> b = sc.parallelize(range(n)).map(lambda j: j)
> c = a.zip(b)
> print a.count(), b.count(), c.count()
>
> >> 6 6 4
>
> by varying n, I find that c.count() is always min(n,4), where 4 happens to
> be the number of threads on my computer. by calling c.collect(), I see the
> RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
> happen without calling map on b.
>
> Any ideas?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>