You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Rui Wang (Jira)" <ji...@apache.org> on 2021/02/01 19:43:00 UTC

[jira] [Commented] (CALCITE-4483) Add WITHIN DISTINCT clause for aggregate functions

    [ https://issues.apache.org/jira/browse/CALCITE-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276604#comment-17276604 ] 

Rui Wang commented on CALCITE-4483:
-----------------------------------

Use the example above, I guess there is an ambiguity to me: 

SUM(age) WITHIN DISTINCT (name)

Which age to use for the same name, especially when for the same name, multiple ages are different? (the age is happen to be the same in this example, but that's no guarantee).

> Add WITHIN DISTINCT clause for aggregate functions
> --------------------------------------------------
>
>                 Key: CALCITE-4483
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4483
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Add a {{WITHIN DISTINCT}} clause to aggregate functions, allowing duplicate rows to be eliminated before entering the function.
> This feature is non-standard, and in fact does not exist in any database I am aware of.
> It is related to {{DISTINCT}}, and is in fact a generalization of it. {{DISTINCT}} can always be rewritten in terms of {{WITHIN DISTINCT}}. For example, {{SUM(DISTINCT sal)}} is equivalent to {{SUM(sal) WITHIN DISTINCT (sal)}}.
> Consider the query
> {noformat}
> SELECT SUM(age),
>    SUM(DISTINCT age),
>   SUM(age) WITHIN DISTINCT (name)
> FROM Friends{noformat}
> where {{Friends}} has the rows
> {noformat}
> name   age job
> ====== === ==========
> Julian  16 Programmer
> Dick    15
> Anne    13 Car wash
> George  15 Lifeguard
> George  15 Dog walker
> Timmy    4
> {noformat}
> Note that there are two rows for George, because she has two jobs.
> The values of the columns are as follows:
>  * {{SUM(age)}} has the value (16 + 15 + 13 + 15 + 15 + 4) = 78;
>  * {{SUM(DISTINCT age)}} has the value (16 + 15 + 13 + 4) = 48;
>  * {{SUM(age) WITHIN DISTINCT (name)}} has the value (16 + 15 + 13 + 15 + 4) = 63.
> {{WITHIN DISTINCT}} has treated the two 15 values for George as one value, but has still counted the 15 for Dick separately.
> The {{WITHIN DISTINCT}} clause can be useful to prevent double-counting when duplicates have been added via a many-to-one join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)