You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Laurens Versluis (Jira)" <ji...@apache.org> on 2022/02/03 15:35:00 UTC

[jira] [Updated] (SPARK-38099) Query using an aggregation on a literal value with an empty underlying dataframe returns null

     [ https://issues.apache.org/jira/browse/SPARK-38099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laurens Versluis updated SPARK-38099:
-------------------------------------
    Description: 
Running a query with an aggregation functions such as average on literal value input with an empty dataframe in the FROM clause causes Spark to return null.

Minimal reproducible example using Spark 3.2.0 with Java 11:

 
{code:java}
sparkSession.emptyDataFrame().createOrReplaceTempView("empty");
StructType someSchema = new StructType(new StructField[]{DataTypes.createStructField("a", DataTypes.IntegerType, false)});
final Row aRow = Row.fromSeq(asScalaBuffer(List.of("a")));
sparkSession.createDataFrame(List.of(aRow), someSchema).createOrReplaceTempView("non_empty");
sparkSession.sql("SELECT avg(1)").show(); // standalone query works
sparkSession.sql("SELECT avg(1) FROM empty").show(); // empty DF gives null
sparkSession.sql("SELECT avg(1) FROM non_empty").show(); // It does work with any non-empty DF{code}
Output is as follows:
{noformat}
+------+
|avg(1)|
+------+
|   1.0|
+------+
+------+
|avg(1)|
+------+
|  null|
+------+
+------+
|avg(1)|
+------+
|   1.0|
+------+
{noformat}
I would expect that the second query also returns 1.0. It seems that any non-empty DataFrame returns 1.0. 

 

Out of curiosity: is this Spark Catalyst doing some empty DataFrame optimizations that affect the result?

  was:
Running a query with an aggregation functions such as average on literal value input with an empty dataframe in the FROM clause causes Spark to return null.

Minimal reproducible example using Spark 3.2.0 with Java 11:

 
{code:java}
sparkSession.emptyDataFrame().createOrReplaceTempView("empty");
StructType someSchema = new StructType();
someSchema = someSchema.add(DataTypes.createStructField("a", DataTypes.IntegerType, false));
final Row aRow = Row.fromSeq(asScalaBuffer(List.of("a")));
sparkSession.createDataFrame(List.of(aRow), someSchema).createOrReplaceTempView("non_empty");
sparkSession.sql("SELECT avg(1)").show(); // standalone query works
sparkSession.sql("SELECT avg(1) FROM empty").show(); // empty DF gives null
sparkSession.sql("SELECT avg(1) FROM non_empty").show(); // It does work with any non-empty DF{code}
Output is as follows:
{noformat}
+------+
|avg(1)|
+------+
|   1.0|
+------+
+------+
|avg(1)|
+------+
|  null|
+------+
+------+
|avg(1)|
+------+
|   1.0|
+------+
{noformat}
I would expect that the second query also returns 1.0. It seems that any non-empty DataFrame returns 1.0. 

 

Out of curiosity: is this Spark Catalyst doing some empty DataFrame optimizations that affect the result?


> Query using an aggregation on a literal value with an empty underlying dataframe returns null
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38099
>                 URL: https://issues.apache.org/jira/browse/SPARK-38099
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>         Environment: Windows 10, Spark 3.2.0, Java 11.
>            Reporter: Laurens Versluis
>            Priority: Major
>
> Running a query with an aggregation functions such as average on literal value input with an empty dataframe in the FROM clause causes Spark to return null.
> Minimal reproducible example using Spark 3.2.0 with Java 11:
>  
> {code:java}
> sparkSession.emptyDataFrame().createOrReplaceTempView("empty");
> StructType someSchema = new StructType(new StructField[]{DataTypes.createStructField("a", DataTypes.IntegerType, false)});
> final Row aRow = Row.fromSeq(asScalaBuffer(List.of("a")));
> sparkSession.createDataFrame(List.of(aRow), someSchema).createOrReplaceTempView("non_empty");
> sparkSession.sql("SELECT avg(1)").show(); // standalone query works
> sparkSession.sql("SELECT avg(1) FROM empty").show(); // empty DF gives null
> sparkSession.sql("SELECT avg(1) FROM non_empty").show(); // It does work with any non-empty DF{code}
> Output is as follows:
> {noformat}
> +------+
> |avg(1)|
> +------+
> |   1.0|
> +------+
> +------+
> |avg(1)|
> +------+
> |  null|
> +------+
> +------+
> |avg(1)|
> +------+
> |   1.0|
> +------+
> {noformat}
> I would expect that the second query also returns 1.0. It seems that any non-empty DataFrame returns 1.0. 
>  
> Out of curiosity: is this Spark Catalyst doing some empty DataFrame optimizations that affect the result?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org