You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge (Jira)" <ji...@apache.org> on 2020/09/08 07:10:00 UTC
[jira] [Updated] (ARROW-9937) [Rust] [DataFusion] Average is not
correct
[ https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge updated ARROW-9937:
-------------------------
Description:
The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations.
The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation.
A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e.
{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation.
We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`.
was:
The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations.
The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation.
A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e.
{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation.
We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`.
> [Rust] [DataFusion] Average is not correct
> ------------------------------------------
>
> Key: ARROW-9937
> URL: https://issues.apache.org/jira/browse/ARROW-9937
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust, Rust - DataFusion
> Reporter: Jorge
> Priority: Major
>
> The current design of aggregates makes the calculation of the average incorrect.
> It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations.
> The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation.
> We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)