You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/05/06 07:00:00 UTC
[jira] [Commented] (KYLIN-5530) Build Performance Optimization

    [ https://issues.apache.org/jira/browse/KYLIN-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720122#comment-17720122 ] 

ASF subversion and git services commented on KYLIN-5530:
--------------------------------------------------------

Commit 597a9367ec12cfe18c8878a548f07db514ce7e7c in kylin's branch refs/heads/kylin5 from Mingming Ge
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=597a9367ec ]

KYLIN-5530 remove repartition write
KYLIN-5530 Optimized snapshot builds
KYLIN-5530 Flat Table Repartition before writing data source tables/directories


> Build Performance Optimization
> ------------------------------
>
>                 Key: KYLIN-5530
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5530
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: 5.0-alpha
>            Reporter: Yaguang Jia
>            Assignee: Yaguang Jia
>            Priority: Major
>             Fix For: 5.0-beta
>
>         Attachments: (Chinese) KYLIN-5530 Build Performance Optimization.pdf, (English) KYLIN-5530 Build Performance Optimization.pdf
>
>
> 1. remove the repartitionWriter method for building indexes
> Background: repartition this behavior on the cloud due to the read and write IO problems of object storage, the implementation costs are too high, which brings more significant problems.
> The current index construction needs to write index data to temp directory first, and then read and repartition into new data files for storage. This method of wasting a lot of IO needs to be removed and modified to directly repartition write into the final index file, transforming spark's repartition, which needs to achieve the following goals:
> - Solve the scenario of skew
> - solve the problem of a large number of small files
> 2. When building a Flat Table, the dimension table directly reads the Snapshot file
> The reasons are as follows:
> - If the dimension table is a view, the view will be calculated once when building a snapshot, and once when building a flat table, so once building a dimension table view, it will be calculated twice.
> - There are uncertainties in the data format of the source data, etc.
> Optimization direction: When building a flat table, the dimension table does not read from the source data, but directly reads the Snapshot file data
>  
> ---
>  
> 1. 去除构建索引的repartitionWriter方法
> 背景：repartition这个行为在云上由于对象存储的读写IO问题，实现成本太高，带来的问题就比较显著。
> 当前索引的构建需要先将索引数据写到temp目录，再读取之后repartition成新的数据文件存储。需要去除这种浪费大量IO的方法，修改为直接repartition写成最终的索引文件，改造spark的repartition，需要达成以下目标：
> - 解决skew的场景
> - 解决大量小文件的问题
>  
> 2. 构建Flat Table时维表直接读取Snapshot的文件
> 原因如下：
> - 如果维表为view，构建snapshot时会计算一次view，构建Flat Table时会计算一次，所以一次构建维表view会计算两次。
> - 源数据的数据格式等存在不确定性
> 优化方向：构建平表时，维表不从源数据读取，直接读取Snapshot文件数据



--
This message was sent by Atlassian Jira
(v8.20.10#820010)