You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (Jira)" <ji...@apache.org> on 2021/02/11 23:55:00 UTC
[jira] [Resolved] (CALCITE-4483) WITHIN DISTINCT clause for
aggregate functions (experimental)
[ https://issues.apache.org/jira/browse/CALCITE-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julian Hyde resolved CALCITE-4483.
----------------------------------
Fix Version/s: 1.27.0
Resolution: Fixed
Fixed in [f1da6550|https://github.com/apache/calcite/commit/f1da65504e598928cf77aa6a7244552692ae2529].
As noted above, this feature is experimental and subject to change or removal without notice.
Further work is required to complete the feature:
* support FILTER clause in combination with WITHIN DISTINCT;
* support WITHIN DISTINCT clause in GROUPING SETS queries;
* duplicate detection fails for aggregate functions that respect nulls.
Regarding the last point, aggregate functions that nulls. Luckily most aggregate functions ignore nulls by default, but exceptions include {{ARRAY_AGG}}. If the input to {{ARRAY_AGG}} for a given value of the {{WITHIN DISTINCT}} key is [null, 1, null], {{MIN}} and {{MAX}} will both be 1, so we cannot determine that all of the values were not the same.
> WITHIN DISTINCT clause for aggregate functions (experimental)
> -------------------------------------------------------------
>
> Key: CALCITE-4483
> URL: https://issues.apache.org/jira/browse/CALCITE-4483
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Assignee: Julian Hyde
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.27.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Add a {{WITHIN DISTINCT}} clause to aggregate functions, allowing duplicate rows to be eliminated before entering the function.
> This feature is non-standard, and in fact does not exist in any database I am aware of. Therefore this feature is experimental, and may evolve or be removed without notice.
> It is related to {{DISTINCT}}, and is in fact a generalization of it. {{DISTINCT}} can always be rewritten in terms of {{WITHIN DISTINCT}}. For example, {{SUM(DISTINCT sal)}} is equivalent to {{SUM(sal) WITHIN DISTINCT (sal)}}.
> Consider the query
> {noformat}
> SELECT SUM(age),
> SUM(DISTINCT age),
> SUM(age) WITHIN DISTINCT (name)
> FROM Friends{noformat}
> where {{Friends}} has the rows
> {noformat}
> name age job
> ====== === ==========
> Julian 16 Programmer
> Dick 15
> Anne 13 Car wash
> George 15 Lifeguard
> George 15 Dog walker
> Timmy 4
> {noformat}
> Note that there are two rows for George, because she has two jobs.
> The values of the columns are as follows:
> * {{SUM(age)}} has the value (16 + 15 + 13 + 15 + 15 + 4) = 78;
> * {{SUM(DISTINCT age)}} has the value (16 + 15 + 13 + 4) = 48;
> * {{SUM(age) WITHIN DISTINCT (name)}} has the value (16 + 15 + 13 + 15 + 4) = 63.
> {{WITHIN DISTINCT}} has treated the two 15 values for George as one value, but has still counted the 15 for Dick separately.
> The {{WITHIN DISTINCT}} clause can be useful to prevent double-counting when duplicates have been added via a many-to-one join.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)