You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Rui Wang (Jira)" <ji...@apache.org> on 2019/12/14 04:39:00 UTC

[jira] [Comment Edited] (CALCITE-3594) Support hot Groupby keys hint

    [ https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995369#comment-16995369 ] 

Rui Wang edited comment on CALCITE-3594 at 12/14/19 4:38 AM:
-------------------------------------------------------------

[~danny0405] first of all, thanks for your detailed written [Hint doc|https://docs.google.com/document/d/1mykz-w2t1Yw7CH6NjUWpWqCAf_6YNKxSc59gXafrNCs/edit#]. I learned a lot from it.

{quote}define a hint keyword (We may some reference with other engines to give it a readable name){quote}
I think you are saying if there is a better name or a common name used by other engines for "hot_key"?

{quote}define the hint options(either a simple identifier list or k-v list), need to make the options as much common to use{quote}
This is a tricky one. I see that there is a support for kv list where key is SqlIdentifier and value is just a string literal. In the hot key case, we will need a list of kv pair where key should be a literal. It is because groupby keys are columns that can have different types. So my current thought is adding support of list of kv pair: Literal:Literal, as a hint's body. What do you think?

{quote}define the hint strategy for the hint item (see if you need some special conditions for the relational expressions){quote}
Agreed. In [HintStrategies|https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/hint/HintStrategies.java#L22], there is no a strategy for aggregation, I probably should add one. Also Aggregrate might should implement Hintable interface as well.



Lastly, in parser I believe there is no support to parse a hint in group by. I will probably also need to add one.


was (Author: amaliujia):
[~danny0405] first of all, thanks for your detailed written [Hint doc|https://docs.google.com/document/d/1mykz-w2t1Yw7CH6NjUWpWqCAf_6YNKxSc59gXafrNCs/edit#]. I learned a lot from it.

{quote}define a hint keyword (We may some reference with other engines to give it a readable name){quote}
I think you are saying if there is a better name or a common name used by other engines for "hot_key"?

{quote}define the hint options(either a simple identifier list or k-v list), need to make the options as much common to use{quote}
This is a tricky one. I see that there is a support for kv list where key is SqlIdentifier and value is just a string literal. In the hot key case, we will need a list of kv pair where key should be a literal. It is because groupby keys can have different types so the value should have different type. So my current thought is adding support of list of kv pair: Literal:Literal, as a hint's body. What do you think?

{quote}define the hint strategy for the hint item (see if you need some special conditions for the relational expressions){quote}
Agreed. In [HintStrategies|https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/hint/HintStrategies.java#L22], there is no a strategy for aggregation, I probably should add one.



Lastly, in parser I believe there is no support to parse a hint in group by. I will probably also need to add one.

> Support hot Groupby keys hint
> -----------------------------
>
>                 Key: CALCITE-3594
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3594
>             Project: Calcite
>          Issue Type: Sub-task
>            Reporter: Rui Wang
>            Assignee: Rui Wang
>            Priority: Major
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys with fanout factor for a column. The fanout factor says how many partition should be created for that specific key, such that we can have a per partition aggregate and then have a final aggregate. One example to explain it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest machine which either slow down the whole pipeline or make retries. In such case, one common resolution is to split data to multiple partition and aggregate per partition, and then have a final combine. 
> Usually execution engine won't know what is the hot key(s). SqlHint provides a good way to tell the engine which key is useful to deal with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)