You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "L. C. Hsieh (Jira)" <ji...@apache.org> on 2022/02/05 08:54:00 UTC

[jira] [Resolved] (SPARK-38099) Query using an aggregation on a literal value with an empty underlying dataframe returns null

     [ https://issues.apache.org/jira/browse/SPARK-38099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

L. C. Hsieh resolved SPARK-38099.
---------------------------------
    Resolution: Invalid

> Query using an aggregation on a literal value with an empty underlying dataframe returns null
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38099
>                 URL: https://issues.apache.org/jira/browse/SPARK-38099
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>         Environment: Windows 10, Spark 3.2.0, Java 11.
>            Reporter: Laurens Versluis
>            Priority: Major
>
> Running a query with an aggregation functions such as average on literal value input with an empty dataframe in the FROM clause causes Spark to return null.
> Minimal reproducible example using Spark 3.2.0 with Java 11:
>  
> {code:java}
> sparkSession.emptyDataFrame().createOrReplaceTempView("empty");
> StructType someSchema = new StructType(new StructField[]{DataTypes.createStructField("a", DataTypes.StringType, false)});
> final Row aRow = Row.fromSeq(asScalaBuffer(List.of("a")));
> sparkSession.createDataFrame(List.of(aRow), someSchema).createOrReplaceTempView("non_empty");
> sparkSession.sql("SELECT avg(1)").show(); // standalone query works
> sparkSession.sql("SELECT avg(1) FROM empty").show(); // empty DF gives null
> sparkSession.sql("SELECT avg(1) FROM non_empty").show(); // It does work with any non-empty DF{code}
> Output is as follows:
> {noformat}
> +------+
> |avg(1)|
> +------+
> |   1.0|
> +------+
> +------+
> |avg(1)|
> +------+
> |  null|
> +------+
> +------+
> |avg(1)|
> +------+
> |   1.0|
> +------+
> {noformat}
> I would expect that the second query also returns 1.0. It seems that any non-empty DataFrame returns 1.0. 
>  
> Out of curiosity: is this Spark Catalyst doing some empty DataFrame optimizations that affect the result?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org