You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kylin.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/04/03 10:27:00 UTC

[jira] [Commented] (KYLIN-3925) Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files

    [ https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808587#comment-16808587 ] 

ASF GitHub Bot commented on KYLIN-3925:
---------------------------------------

kyotoYaho commented on pull request #580: KYLIN-3925 Add reduce step for FilterRecommendCuboidDataJob & UpdateO…
URL: https://github.com/apache/kylin/pull/580
 
 
   …ldCuboidShardJob to avoid generating small hdfs files
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>
> Previously when doing cube optimization, there're two map only MR jobs: *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of map only job is to avoid shuffling. However, this benefit will bring a more severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. If the block size is 100M, there'll be 10*(500/100) mappers for the map only job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file. Finally there'll be 50 hdfs files. Since the job *FilterRecommendCuboidDataJob* will filter out the cuboid data used for future, the data size of each file will be less than 100M. In some cases, it will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step to control the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)