You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "hongbin ma (JIRA)" <ji...@apache.org> on 2016/12/19 01:48:58 UTC
[jira] [Comment Edited] (KYLIN-2269) Reduce MR memory usage for global dict

    [ https://issues.apache.org/jira/browse/KYLIN-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15759862#comment-15759862 ] 

hongbin ma edited comment on KYLIN-2269 at 12/19/16 1:48 AM:
-------------------------------------------------------------

Hi [~kangkaisen]

Thanks for the patch! I took a simple look, it seems to me the patch aims to add sortness to "DISTRIBUTE BY" by using "CLUSTER BY". Since  "DISTRIBUTE BY" is already leveraged the "shardby" column and the toggle "kylin.source.hive.redistribute-flat-table", I'm afraid current implementation (adding a separated kylin.source.hive.flat-table-cluster-by-column) might bring too much complexity, what if a user set column A as shardBy column and set column B as cluster by column? The configuration entry does not seem to prohibit this.  

Can we keep leveraging shardBy column? There could be two toggles, "kylin.source.hive.redistribute-flat-table" and "kylin.source.hive.redistribute-and-sort-flat-table", the latter config's priority precedes the first's. Like the logic in org.apache.kylin.job.JoinedFlatTable#appendDistributeStatement, if shardBy column is specified, shardBy column will be used for DISTRIBUTE or CLUSTER, otherwise only  DISTRIBUTE BY RAND().  You patch can use the latter toggle. Both of the configurations will be overwritable from cube level.

Please leave your comment


was (Author: mahongbin):
Hi [~kangkaisen]

Thanks for the patch! I took a simple look, it seems to me the patch aims to add sortness to "DISTRIBUTE BY" by using "CLUSTER BY". Since  "DISTRIBUTE BY" is already leveraged the "shardby" column and the toggle "kylin.source.hive.redistribute-flat-table", I'm afraid current implementation (adding a separated kylin.source.hive.flat-table-cluster-by-column) might bring too much complexity, what if a user set column A as shardBy column and set column B as cluster by column? The configuration entry does not seem to prohibit this.  

Can we keep leveraging shardBy column? There could be two toggles, "kylin.source.hive.redistribute-flat-table" and "kylin.source.hive.redistribute-and-sort-flat-table", the latter config's priority precedes the first's. Like the logic in org.apache.kylin.job.JoinedFlatTable#appendDistributeStatement, if shardBy column is specified, shardBy column will be used for DISTRIBUTE or CLUSTER, otherwise only  DISTRIBUTE BY RAND().  You patch can use the latter toggle. Both of the configurations will be overwritable from cube level.

> Reduce MR memory usage for global dict
> --------------------------------------
>
>                 Key: KYLIN-2269
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2269
>             Project: Kylin
>          Issue Type: Improvement
>    Affects Versions: v1.6.0
>            Reporter: kangkaisen
>            Assignee: kangkaisen
>         Attachments: KYLIN-2269.patch
>
>
> currently, in {{Build Base Cuboid Data}}, if user use the global dict and the global dict size significantly larger the mapper memory size, the {{CachedTreeMap}} will load all values as much as possible and the soft references object will stick around for a while when GC, So which will make the {{Build Base Cuboid Data}}  mapper pause for a long time even could not  finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)