You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Huaxin Gao (JIRA)" <ji...@apache.org> on 2017/10/13 22:35:01 UTC
[jira] [Comment Edited] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

    [ https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204281#comment-16204281 ] 

Huaxin Gao edited comment on SPARK-22271 at 10/13/17 10:34 PM:
---------------------------------------------------------------

I looked the code, in Average.scala, it has
{code}
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    ......
  }
{code}
When using Shafique's test data, dt has precision 38 and scale 36. count is 299. Cast(count, dt) will set the scale to 36 and precision to 39, this will cause overflow. 
I have a fix and will submit a PR soon. 


was (Author: huaxingao):
I looked the code, in Average.scala, it has
```
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    ......
  }
```
When using Shafique's test data, dt has precision 38 and scale 36. count is 299. Cast(count, dt) will set the scale to 36 and precision to 39, this will cause overflow. 
I have a fix and will submit a PR soon. 

> Describe results in "null" for the value of "mean" of a numeric variable
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22271
>                 URL: https://issues.apache.org/jira/browse/SPARK-22271
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: 
>            Reporter: Shafique Jamal
>            Priority: Minor
>         Attachments: decimalNumbers.zip
>
>
> Please excuse me if this issue was addressed already - I was unable to find it.
> Calling .describe().show() on my dataframe results in a value of null for the row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")        
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +-------+--------------------+
> |summary|     numericvariable|
> +-------+--------------------+
> |  count|                 299|
> |   mean|                null|
> | stddev|  0.2376438793946738|
> |    min|0.037815489727642...|
> |    max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +-------+---------------------------+
> |summary|bround(numericvariable, 31)|
> +-------+---------------------------+
> |  count|                        299|
> |   mean|       0.139522503183236...|
> | stddev|         0.2376438793946738|
> |    min|       0.037815489727642...|
> |    max|       2.138189366554511...|
> +-------+---------------------------+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org