You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by lsn24 <le...@gmail.com> on 2019/03/27 00:29:22 UTC
SortMerge Join on partitioned column causes shuffle
Hello,
We got two datasets thats been persisted as follows:
Dataset A:
datasetA.repartition(5,datasetA.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableA));
Dataset B:
datasetB.repartition(5,datasetB.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableB));
When I do a join with region and studentId , I see shuffle. If I do join
just with the bucketed column studentId, there is NO shuffle as expected.
Below is the join query.
spark.sql("Select * from database.tableA").join(spark.sql("Select * from
database.tableB "), Seq("studentId","region")).show(10)
What could be the reason for the shuffle when we include the partitionkey
and how can we mitigate it ?
Thanks
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org