You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pablo Langa Blanco (Jira)" <ji...@apache.org> on 2021/04/05 23:21:00 UTC
[jira] [Commented] (SPARK-34464) `first` function is sorting the
dataset while sometimes it is used to get "any value"
[ https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315152#comment-17315152 ]
Pablo Langa Blanco commented on SPARK-34464:
--------------------------------------------
I've open SPARK-34961 because I think It's an improvement not a bug so I will raise a PR there.
> `first` function is sorting the dataset while sometimes it is used to get "any value"
> -------------------------------------------------------------------------------------
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Louis Fruleux
> Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one usually uses [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which is not always necessary and might impact performances. Is this the desired behavior?
>
>
> {code:java}
> Current behavior:
> val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
> /*
> == Physical Plan ==
> SortAggregate(key=key#342, functions=first(value#343, false))
> +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
> +- Exchange hashpartitioning(key#342, 200)
> +- SortAggregate(key=key#342, functions=partial_first(value#343, false))
> +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
> +- LocalTableScan key#342, value#343
> */
> {code}
>
> My understanding of the source code does not allow me to fully understand why this is the current behavior.
> The solution might be to implement a new aggregate function. But the code would be highly similar to the first one. And if I don't fully understand why this [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] method falls back to SortAggregate.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org