You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "dharani_sugumar (Jira)" <ji...@apache.org> on 2023/11/26 06:26:00 UTC
[jira] [Updated] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
[ https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
dharani_sugumar updated SPARK-46105:
------------------------------------
Summary: df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above (was: df.emptyDataFrame shows 1 if we repartition)
> df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
> -----------------------------------------------------------------------
>
> Key: SPARK-46105
> URL: https://issues.apache.org/jira/browse/SPARK-46105
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.3.3
> Environment: EKS
> EMR
> Reporter: dharani_sugumar
> Priority: Major
> Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png
>
>
> {color:#FF0000}Version: 3.3.3{color}
>
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:> (0 + 1) / res2: Boolean = true{color}
> Version: 3.2.4
> scala> val df = spark.emptyDataFrame
> df: org.apache.spark.sql.DataFrame = []
> scala> df.rdd.getNumPartitions
> res0: Int = 0
> scala> df.repartition(1).rdd.getNumPartitions
> res1: Int = 0
> scala> df.repartition(1).rdd.isEmpty()
> res2: Boolean = true
>
> {color:#FF0000}Version: 3.5.0{color}
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:> (0 + 1) / res2: Boolean = true{color}
>
> When we do repartition of 1 on an empty dataframe, the resultant partition is 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the resultant partition is 0. May i know why this behaviour is changed from 3.2.x to higher versions.
>
> The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. I'm not sure why this behaves different in both the environments. As a interim solution, I had to repartition a empty dataframe if my final dataframe is empty which returns 1 for 3.3.3. Would like to know if this really a bug or this behaviour exists in the future versions and cannot be changed?
>
> Because, If we go for a spark upgrade and this behaviour is changed, we will face the issue again.
> Please confirm on this.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org