You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Boaz Ben-Zvi (JIRA)" <ji...@apache.org> on 2017/07/08 02:17:00 UTC

[jira] [Commented] (DRILL-5665) planner.force_2phase.aggr Set to TRUE for HashAgg may cause wrong results for VARIANCE and STD_DEV

    [ https://issues.apache.org/jira/browse/DRILL-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078907#comment-16078907 ] 

Boaz Ben-Zvi commented on DRILL-5665:
-------------------------------------

Two phase aggregation can only work when the aggregation function is "additive"; that is, can do any (arbitrary) part of it in the first phase, and finish up the rest in the second phase. (E.g., SUM, COUNT, MIN, MAX ....). Some aggregation functions can be converted into "additive", like AVG --> SUM + COUNT. Some aggregation functions cannot, like variance() and std_dev(). The above new "forcing" option was crudely added into create2PhasePlan(), with no regard to the type of aggregation function, hence causing the wrong results for variance() and std_dev().

Possible simple fix: Use the same condition (i.e., is it SUM, COUNT, etc) in conjunction with the new option. Hence the new option would not apply for these "non-additive" functions.

> planner.force_2phase.aggr Set to TRUE for HashAgg may cause wrong results for VARIANCE and STD_DEV
> --------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5665
>                 URL: https://issues.apache.org/jira/browse/DRILL-5665
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Boaz Ben-Zvi
>             Fix For: 1.11.0
>
>
> *planner.force_2phase.aggr* was added for testing the hash 2-phase spill to disk aggregation implementation. However, if it is set to true, stream aggregate will run in two phase too and return wrong results for some functions such as variance() and std_dev().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)