You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/05/18 21:40:00 UTC

[jira] [Commented] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

    [ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548617#comment-14548617 ] 

Apache Spark commented on SPARK-7696:
-------------------------------------

User 'ogirardot' has created a pull request for this issue:
https://github.com/apache/spark/pull/6237

> Aggregate function's result should be nullable only if the input expression is nullable
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-7696
>                 URL: https://issues.apache.org/jira/browse/SPARK-7696
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0, 1.3.1
>            Reporter: Haopu Wang
>            Priority: Minor
>
> In SparkSQL, the aggregate function's result currently is always nullable.
> It will make sense to change the behavior as: if the input expression is nullable, the result is nullable; Otherwise, the result is non-nullable.
> Please see the following discussion:
> >>>>>>>>>>>>>>>
> From: Olivier Girardot [mailto:ssaboum@gmail.com] 
> Sent: Tuesday, May 12, 2015 5:12 AM
> To: Reynold Xin
> Cc: Haopu Wang; user
> Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
>  
> I'll look into it - not sure yet what I can get out of exprs :p 
>  
> Le lun. 11 mai 2015 à 22:35, Reynold Xin <rx...@databricks.com> a écrit :
> Thanks for catching this. I didn't read carefully enough.
>  
> It'd make sense to have the udaf result be non-nullable, if the exprs are indeed non-nullable.
>  
> On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ss...@gmail.com> wrote:
> Hi Haopu, 
> actually here `key` is nullable because this is your input's schema : 
> scala> result.printSchema
> root 
> |-- key: string (nullable = true) 
> |-- SUM(value): long (nullable = true) 
> scala> df.printSchema 
> root 
> |-- key: string (nullable = true) 
> |-- value: long (nullable = false)
>  
> I tried it with a schema where the key is not flagged as nullable, and the schema is actually respected. What you can argue however is that SUM(value) should also be not nullable since value is not nullable.
>  
> @rxin do you think it would be reasonable to flag the Sum aggregation function as nullable (or not) depending on the input expression's schema ?
>  
> Regards, 
>  
> Olivier.
> Le lun. 11 mai 2015 à 22:07, Reynold Xin <rx...@databricks.com> a écrit :
> Not by design. Would you be interested in submitting a pull request?
>  
> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
> I try to get the result schema of aggregate functions using DataFrame
> API.
> However, I find the result field of groupBy columns are always nullable
> even the source field is not nullable.
> I want to know if this is by design, thank you! Below is the simple code
> to show the issue.
> ======
>   import sqlContext.implicits._
>   import org.apache.spark.sql.functions._
>   case class Test(key: String, value: Long)
>   val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF
>   val result = df.groupBy("key").agg($"key", sum("value"))
>   // From the output, you can see the "key" column is nullable, why??
>   result.printSchema
> //    root
> //     |-- key: string (nullable = true)
> //     |-- SUM(value): long (nullable = true)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org