You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "dharani_sugumar (Jira)" <ji...@apache.org> on 2023/11/26 06:26:00 UTC

[jira] [Updated] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above

     [ https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dharani_sugumar updated SPARK-46105:
------------------------------------
    Summary: df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above  (was: df.emptyDataFrame shows 1 if we repartition)

> df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
> -----------------------------------------------------------------------
>
>                 Key: SPARK-46105
>                 URL: https://issues.apache.org/jira/browse/SPARK-46105
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.3
>         Environment: EKS
> EMR
>            Reporter: dharani_sugumar
>            Priority: Major
>         Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png
>
>
> {color:#FF0000}Version: 3.3.3{color}
>  
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true{color}
> Version: 3.2.4
> scala> val df = spark.emptyDataFrame
> df: org.apache.spark.sql.DataFrame = []
> scala> df.rdd.getNumPartitions
> res0: Int = 0
> scala> df.repartition(1).rdd.getNumPartitions
> res1: Int = 0
> scala> df.repartition(1).rdd.isEmpty()
> res2: Boolean = true
>  
> {color:#FF0000}Version: 3.5.0{color}
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true{color}
>  
> When we do repartition of 1 on an empty dataframe, the resultant partition is 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the resultant partition is 0. May i know why this behaviour is changed from 3.2.x to higher versions. 
>  
> The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. I'm not sure why this behaves different in both the environments. As a interim solution, I had to repartition a empty dataframe if my final dataframe is empty which returns 1 for 3.3.3. Would like to know if this really a bug or this behaviour exists in the future versions and cannot be changed?
>  
> Because, If we go for a spark upgrade and this behaviour is changed, we will face the issue again. 
> Please confirm on this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org