You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lsn24 <le...@gmail.com> on 2019/03/27 00:29:22 UTC

SortMerge Join on partitioned column causes shuffle

Hello,

 We got two datasets thats been persisted as follows:

Dataset A:
datasetA.repartition(5,datasetA.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableA));

Dataset B:
datasetB.repartition(5,datasetB.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableB));


When I do a  join with region and studentId , I see shuffle. If I do join
just with the bucketed column studentId, there is NO shuffle as expected.
Below is the join query.

spark.sql("Select *  from  database.tableA").join(spark.sql("Select *  from  
database.tableB "), Seq("studentId","region")).show(10)

What could be the reason for the shuffle when we include the partitionkey
and how can we mitigate it ?

Thanks






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org