You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2023/03/17 12:37:00 UTC
[jira] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760 ]
Yuming Wang deleted comment on SPARK-42760:
-------------------------------------
was (Author: apachespark):
User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40380
> The partition of result data frame of join is always 1
> ------------------------------------------------------
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, local mode
> Reporter: binyang
> Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
> spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
> spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> df1 = spark.range(1, 1000).repartition(data_partitions)
> df2 = spark.range(1, 2000).repartition(data_partitions)
> df3 = spark.range(1, 3000).repartition(data_partitions)
> print("Data partitions is: {}. Shuffle partitions is {}".format(data_partitions, shuffle_partitions))
> print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))
> df = (df1.join(df2, df1.id == df2.id)
> .join(df3, df1.id == df3.id))
> print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org