You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 夏俊鸾 <xi...@gmail.com> on 2014/11/12 08:31:19 UTC
About Join operator in PySpark
Hi all
I have noticed that “Join” operator has been transferred to union and
groupByKey operator instead of cogroup operator in PySpark, this change
will probably generate more shuffle stage, for example
rdd1 = sc.makeRDD(...).partitionBy(2)
rdd2 = sc.makeRDD(...).partitionBy(2)
rdd3 = rdd1.join().collect()
Above code implemented with scala will generate 2 shuffle, but will
generate 3 shuffle with PySpark. what is initial design motivation of join
operator in PySpark? Any idea to improve join performance in PySpark?
Andrew