You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Danny Chen (Jira)" <ji...@apache.org> on 2019/12/12 01:29:00 UTC

[jira] [Commented] (CALCITE-3594) Support hot Groupby keys hint

    [ https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994046#comment-16994046 ] 

Danny Chen commented on CALCITE-3594:
-------------------------------------

Thanks for firing this issue [~amaliujia] ~, currently, there is no any hints item implementations in Calcite,
i'm glad that we now began to add some builtin items

To add one, we may:

- define a hint keyword (We may some reference with other engines to give it a readable name)
- define the hint options(either a simple identifier list or k-v list), need to make the options as much common to use
- define the hint strategy for the hint item (see if you need some special conditions for the relational expressions)

> Support hot Groupby keys hint
> -----------------------------
>
>                 Key: CALCITE-3594
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3594
>             Project: Calcite
>          Issue Type: Sub-task
>            Reporter: Rui Wang
>            Assignee: Rui Wang
>            Priority: Major
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys with fanout factor for a column. The fanout factor says how many partition should be created for that specific key, such that we can have a per partition aggregate and then have a final aggregate. One example to explain it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest machine which either slow down the whole pipeline or make retries. In such case, one common resolution is to split data to multiple partition and aggregate per partition, and then have a final combine. 
> Usually execution engine won't know what is the hot key(s). SqlHint provides a good way to tell the engine which key is useful to deal with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)