You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cameron Davidson-Pilon <ca...@shopify.com> on 2016/05/17 05:50:14 UTC

question about Union in pyspark and preserving partitioners

I'm looking into how to do more efficient jobs by using partition
strategies, but I'm hitting a blocker after I do a `union` between two
RDDs. Suppose A and B are both RDDs with the same partitioner, that is,

`A.partitioner == B.partitioner`

If I do A.union(B), the resulting RDD has None partitioner. If I am reading
the code correctly, this comes from this line returning False and a
partitioner never being set:
https://github.com/apache/spark/blob/95f4fbae52d26ede94c3ba8248394749f3d95dcc/python/pyspark/rdd.py#L532

What confuses me is that that line will _always_ return False: is it not
true that the union of RDDs results in the sum of number of partitions,
which will never be equal to self.getNumPartitions?

Am I missing something about this logical check?

-- 


Cam Davidson-Pilon
cam.dp@shopify.com