You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2016/02/20 02:57:18 UTC
[jira] [Updated] (PHOENIX-2700) Push down count(group by key)
queries
[ https://issues.apache.org/jira/browse/PHOENIX-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Taylor updated PHOENIX-2700:
----------------------------------
Description:
Queries that attempt to detect duplicates potentially return a lot of data to the client if the column being deduped is near unique. For example:
{code}
SELECT SUM(DUP_COUNT)
FROM (
SELECT DEDUP_KEY, COUNT(1) As DUP_COUNT
FROM TABLE_TO_DEDUP
GROUP BY DEDUP_KEY
)
WHERE DUP_COUNT > 1
{code}
If all of the following are true, then we can detect duplicates on the region server in our coprocessors instead of returning every unique DEDUP_KEY to the client for a final merge:
- each scan won't be split on the same DEDUP_KEY
- the DEDUP_KEY is the leading primary key column
- we can push the DUP_COUNT > 1 evaluation through our coprocessor
The first requirement is the hardest, but potentially there could be a custom split policy added.
was:
Queries that attempt to detect duplicates potentially return a lot of data to the client if the column being deduped is near unique. For example:
{code}
SELECT SUM(DUP_COUNT)
FROM (
SELECT DEDUP_KEY, COUNT(1) As DUP_COUNT
FROM TABLE_TO_DEDUP
GROUP BY DEDUP_KEY
)
WHERE DUP_COUNT > 1
{code}
If all of the following are true, then we can detect duplicates on the region server in our coprocessors instead of returning every unique DEDUP_KEY to the client for a final merge:
- each scan won't be split on the same DEDUP_KEY
- the DEDUP_KEY is the leading primary key column
- we can push the DUP_COUNT > 1 evaluation through our coprocessor
> Push down count(group by key) queries
> -------------------------------------
>
> Key: PHOENIX-2700
> URL: https://issues.apache.org/jira/browse/PHOENIX-2700
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
>
> Queries that attempt to detect duplicates potentially return a lot of data to the client if the column being deduped is near unique. For example:
> {code}
> SELECT SUM(DUP_COUNT)
> FROM (
> SELECT DEDUP_KEY, COUNT(1) As DUP_COUNT
> FROM TABLE_TO_DEDUP
> GROUP BY DEDUP_KEY
> )
> WHERE DUP_COUNT > 1
> {code}
> If all of the following are true, then we can detect duplicates on the region server in our coprocessors instead of returning every unique DEDUP_KEY to the client for a final merge:
> - each scan won't be split on the same DEDUP_KEY
> - the DEDUP_KEY is the leading primary key column
> - we can push the DUP_COUNT > 1 evaluation through our coprocessor
> The first requirement is the hardest, but potentially there could be a custom split policy added.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)