You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/07/09 10:02:00 UTC

[jira] [Commented] (SPARK-24605) size(null) should return null

    [ https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536743#comment-16536743 ] 

Apache Spark commented on SPARK-24605:
--------------------------------------

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21736

> size(null) should return null
> -----------------------------
>
>                 Key: SPARK-24605
>                 URL: https://issues.apache.org/jira/browse/SPARK-24605
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Maxim Gekk
>            Assignee: Maxim Gekk
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> The default behavior size(null) == -1 is a big problem for several reasons:
> # It is inconsistent with how SQL functions handle nulls.
> # It is an extreme violation of [the Principle of Least Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] (POLA)
> # It is not called out anywhere in the Spark docs or even [the Hive docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> # It can lead to subtle bugs in analytics.
> For example, our client discovered this behavior while investigating post-click user engagement in their AdTech system. The schema was per ad placement and post-click user engagements were in an array of structs. The culprit was df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), ...), which subtracted 1 for every click without post-click engagement. Luckily, the behavior led to negative engagement counts in some periods, which alerted them to the problem and this bizarre behavior.
> Current behavior Spark inherited from Hive. The most consistent behavior, ignoring the insanity that Hive created in the first place, is for size(null) to behave as length(null), which returns null. This handles the aggregation case with sum/avg, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org