You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2018/05/23 19:29:00 UTC

[jira] [Commented] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN

    [ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487897#comment-16487897 ] 

James Taylor commented on PHOENIX-4751:
---------------------------------------

Couple thoughts on this:
 * see SpillableGroupByCache for an implementation of a spillable hash map that could be used for hash-based aggregation. The comment at the top references an algorithm described here: [http://db.inf.uni-tuebingen.de/files/teaching/ws1011/db2/db2-hash-indexes.pdf.] It's difficult to get good performance once you have to start spilling to disk.
 * another alternative would be to sort on the region server as this would distribute the sort across the cluster. The reason the sort is done at all is to make it scalable to do the final aggregation through a merge sort.
 * introduce a shuffle step in the query plan to prevent aggregating on the client. This could use an UPSERT SELECT command to write intermediate aggregate results to a temp table followed by running an aggregate query on the results. In this case, the results would be naturally sorted by HBase.

> Support client-side hash aggregation with SORT_MERGE_JOIN
> ---------------------------------------------------------
>
>                 Key: PHOENIX-4751
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4751
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.14.0, 4.13.1
>            Reporter: Gerald Sangudi
>            Priority: Major
>
> A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance.
> When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality.
> The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality.
> CREATE TABLE unsalted (
>        keyA BIGINT NOT NULL,
>        keyB BIGINT NOT NULL,
>        val SMALLINT,
>        CONSTRAINT pk PRIMARY KEY (keyA, keyB)
> );
> EXPLAIN
> SELECT /*+ USE_SORT_MERGE_JOIN */ 
> t1.val v1, t2.val v2, COUNT(\*) c 
> FROM unsalted t1 JOIN unsalted t2 
> ON (t1.keyA = t2.keyA) 
> GROUP BY t1.val, t2.val;
> +------------------------------------------------------------+-----------------+----------------+--+
> |                            PLAN                            | EST_BYTES_READ  | EST_ROWS_READ  |  |
> +------------------------------------------------------------+-----------------+----------------+--+
> | SORT-MERGE-JOIN (INNER) TABLES                             | null            | null           |  |
> |     CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED  | null            | null           |  |
> | AND                                                        | null            | null           |  |
> |     CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED  | null            | null           |  |
> | CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]              | null            | null           |  |
> | CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]    | null            | null           |  |
> +------------------------------------------------------------+-----------------+----------------+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)