You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kylin.apache.org by "Xiaoxiang Yu (Jira)" <ji...@apache.org> on 2021/04/21 02:33:00 UTC

[jira] [Resolved] (KYLIN-4945) Repartition encoded dataset to avoid data skew caused by a single column

     [ https://issues.apache.org/jira/browse/KYLIN-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaoxiang Yu resolved KYLIN-4945.
---------------------------------
    Resolution: Fixed

> Repartition encoded dataset to avoid data skew caused by a single column
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-4945
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4945
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v4.0.0-beta
>            Reporter: ShengJun Zheng
>            Assignee: ShengJun Zheng
>            Priority: Minor
>             Fix For: v4.0.0-GA
>
>         Attachments: image-2021-03-24-17-37-57-505.png
>
>
> In KYLIN4, global dictionary will be split into several buckets. To encode flat datasource table more efficiently, source dataset will be repartitioned to the same amount of partitions as the dictionary's bucket size. It sometimes bring side effect, because repartition by a single column is more likely to cause data skew.
> We have a case that a topN/count_distinct measure has serious data skew. The dataset get skewed after repartition, causing one task takes the majority of time in first layer's cuboid build job.
> !image-2021-03-24-17-37-57-505.png!
> To improve this case, we add a step to repartition the encoded dataset by all RowKey columns, and the first layer's build time reduced from 20min to 4min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)