You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/03/11 19:41:00 UTC

[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1

    [ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699264#comment-17699264 ] 

Apache Spark commented on SPARK-42760:
--------------------------------------

User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40380

> The partition of result data frame of join is always 1
> ------------------------------------------------------
>
>                 Key: SPARK-42760
>                 URL: https://issues.apache.org/jira/browse/SPARK-42760
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.3.2
>         Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, local mode
>            Reporter: binyang
>            Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org