You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "nichunen (JIRA)" <ji...@apache.org> on 2019/07/17 03:56:00 UTC

[jira] [Resolved] (KYLIN-3925) Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files

     [ https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nichunen resolved KYLIN-3925.
-----------------------------
       Resolution: Fixed
    Fix Version/s:     (was: v3.0.0)
                   v3.0.0-alpha2

> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v3.0.0-alpha2
>
>
> Previously when doing cube optimization, there're two map only MR jobs: *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of map only job is to avoid shuffling. However, this benefit will bring a more severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. If the block size is 100M, there'll be 10*(500/100) mappers for the map only job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file. Finally there'll be 50 hdfs files. Since the job *FilterRecommendCuboidDataJob* will filter out the cuboid data used for future, the data size of each file will be less than 100M. In some cases, it will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step to control the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)