You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/07/23 14:53:18 UTC

[GitHub] [incubator-doris] ZhangYu0123 opened a new issue #4164: [Proposal] Compaction rules optimization

ZhangYu0123 opened a new issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164


   **Describe problems**
   At present, the Compaction in BE is divided into two parts: Base and Cumulative. For the incremental overlap rowsets, the Cumulative Compaction process is merged first, and then in Base Compaction process nonoverlap rowsets are merged. When there are frequent small batches of continuous writing, a large number of small nonoverlap rowsets to Base Compaction process. This will cause the following problems:
   (1) In Base Compaction, each time a small batch file data is merged, the merge efficiency of Base Compaction is very low (it takes a long time to merge many small files, the number of merges of Base is more, and the write amplification is serious).
   (2) There are too many small files at the same time, and too many files will be read during query, which affects the query efficiency.
    
   **Solution**
   **base compaction process**
   The base compaction process is unchanged, the existing process is maintained, and the versions from Base to Cumulative Point are merged.
   
   **comulative compaction process**
   _**1  Cumulative Point calculation (changing part)**_
    Cumulative Point calculation will be executed once when it is initialized.
   Traverse the rowsets collection in version order. If the following conditions are satisfied, the Cumulative Point moves down, otherwise it breaks:
   - The current rowset is a non-overlap type.
   - The rowset disk space is greater than base_compaction_delta_ratio(2%) of the base version or greater than or equal to min_base_compaction_promotion_size(2G).
   
   2  Select the Rowset to be merged by Cumulative Compaction
   2.1 Candidate_rowsets collection of candidate sets, (consistent with the original process)
   - The candidate set is selected to be greater than or equal to all versions of Cumulative Point. And sort.
   - Collect rowset versions from the candidate set, and stop collecting when compaction_score is greater than or equal to 1000 or when the traversal reaches the end.
   - When there is a deletion condition for candidate_rowsets, the previous version is collected and deleted. Move Cumulative Point after deleting rowset.
   
   _**2.2 Create input_rowsets  (changing part)**_
   (1) Definition:
   - When the total number of disks in the candidate set candidate_rowsets s_total,
   - The disk space s1 of the first version in candidate_rowsets.
   - level_size(n) represents n to the disk size segment that can be mapped to [1G~2G), [512m~1G), [256m~512m), [128m~256m), [64m~128m), this segments can be calculated by min_base_compaction_promotion_size.
   
   (2) Implementation process:
   - When s_total is greater than or equal to base_compaction_delta_ratio(2%) of the base version or greater than or equal to min_base_compaction_promotion_size(2G) or s_total <= smallest_compaction_limit(64m), candidate_rowsets is directly assigned to input_rowsets and exits.
   - When level_size (s_total-s1) >= level_size (s1), candidate_rowsets are directly assigned to input_rowsets and exit.
   - When level_size (s_total-s1) <level_size (s1), after removing the first version of candidate_rowsets, return to the first step and loop the process.
   
   (3) Check input_rowsets (consistent with the original process)
   - When the compaction_score of input_rowsets is less than the minimum score of 5, input_rowsets are cleared and not merged.
   - Check that input_rowsets is empty, and supplement the check according to time. If the merge is not performed for more than 1 day, and there is an overlap type rowset in candidate_rowsets, candidate_rowsets is directly assigned to input_rowsets and exits.
   
   3 Execute compaction (consistent with the original process)
   
   _**4 Cumulative Point update  (changing part)**_
   When the disk space of the newly generated rowset is greater than base_compaction_delta_ratio(2%) of the base version rowset or greater than or equal to min_base_compaction_promotion_size(2G), the Cumulative Point moves forward, otherwise it does not move.
   
   
   **Evaluation of merger effect**
   1. Configuration variable selection
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   2. File num growth trend
   ![image](https://user-images.githubusercontent.com/67053339/88299940-836ad980-cd35-11ea-9a54-96d9998abcad.png)
   
   _**3. Comparison of two strategy**_
   This includes base compaction and cumulative compaction. Assuming that the file size of each import is d, a cumulative compaction is performed after q rounds of data load, and a base compaction is performed after k cumulative compactions. The import of d is executed n times. It is assumed that k is greater than 4.
   
   3.1 old strategy
   Maximum number of existing files = k + q
   Base Compaction times p = n / (k * q)
   Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q
   Cumulative IO write data volume = d*n
   
   3.2 this new strategy
   3.2.1 The number of files: 
   a) when q*d> 2G:
   Maximum number of existing files = k + 6
   b) when q*d <2G:
   Maximum number of existing files = k * q * d / 2G + 6
   Can avoid too many files caused by too many small batches
   
   3.2.2 Base IO write amplification situation:
   a) When d*q*k >= 8G:
   Base Compaction times p = n / (k * q)
   Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q
   
   **_In this case, Base Compaction IO is basically the same as the previous Compaction IO situation._**
   
   b) When d*q*k <8G and the data volume is less than 100G:
   Base Compaction times p = log(1 + 8%)(n), here does not depend on the times k
   Base IO write data volume = d(1-n/1.08) / (1-1.08) = 18.51* n * d – 20d
   
   **_In this case, Base Compaction IO is at the O(n) level, which is significantly better than the previous O(n2) situation. The 8% here is due to the base compaction limit when more than 5 versions (including base) are merged, and the worst-case evaluation is here._**
   
   c) When d*q*k <8G and the data volume is greater than 100G:
   Base Compaction times p = n*d / 8G
   Base IO write data volume = 8G * p (p + 1) / 2 = d * n * (n + d/8G) / 2 * (d / 8G)
   
   _**In this case, the Base Compaction IO is basically the same as the previous Compaction IO. It will bring an increase of 8G/(d*q*k) times.**_
   
   3.2.3 Cumulative IO write amplification situation:
   When q*d> 2G:
   Cumulative IO write data volume = n * d
   When q*d <2G:
   Cumulative IO write data volume = (log (2G / q*d) + 1) * n * d
   
   3.4 In conclusion:
   **_When q*d <2G, Cumulative IO wastes log (2G / q*d) write amplification and reduces
   k-k*q*d / 2G number of files. 
   When k*q*d <= 8G, Base IO greatly reduces write amplification.
   When q*d> 2G, the effect is the same as before._**
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 commented on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 commented on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-669154779


   ## Self-test performance data
   ### Optimized version
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 2,  cumulative times = 115
   max number of files after cumulative point = 4
   max number of files before cumulative point = 16
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 2,  cumulative times = 80
   max number of files after cumulative point = 6
   max number of files before cumulative point = 9
   
   ### original version
   
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 5,  cumulative times = 125
   max number of files after cumulative point = 74
   max number of files before cumulative point = 12
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 3,  cumulative times = 100
   max number of files after cumulative point = 95
   max number of files before cumulative point = 2
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 edited a comment on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 edited a comment on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-669154779


   ## Self-test performance data
   base rowset size 10G
   
   ### Optimized version
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 2,  cumulative times = 115
   max number of files after cumulative point = 4
   max number of files before cumulative point = 16
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 2,  cumulative times = 80
   max number of files after cumulative point = 6
   max number of files before cumulative point = 9
   
   ### original version
   
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 5,  cumulative times = 125
   max number of files after cumulative point = 74
   max number of files before cumulative point = 12
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 3,  cumulative times = 100
   max number of files after cumulative point = 95
   max number of files before cumulative point = 2
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 commented on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 commented on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-663071982


   ### 问题描述
    目前，在BE的Compaction分为Base和Cumulative两个部分。对于增量的overlap的Rowset先由Cumulative Compaction流程合并，合并完之后交由Base Compaction流程进行合并。当有频繁的小批量不断写入时，会产生大量Rowset直接交给Base合并。这样会引发如下问题：
   （1）Base Compaction时，每次由Base合并一堆小文件数据，Base的合并效率较低（用时较长合并了很多小体积文件，Base的合并次数较多，写放大严重）。
   （2）同时存在的小文件过多，查询时会读取过多的文件进行合并，影响查询效率。
   
   ### 解决方案
    为兼顾合并效率（减少合并过程中IO操作）和减少文件数量提升查询效率。提出了如下改进方案：
   ### 1、base compaction流程
   base compaction流程不变，保持现有流程，合并从Base到Cumulative Point的版本。
   ### 2、cumulative compaction流程
   ### 2.1 Cumulative Point计算
   这里新创建或初始化的时候会执行一次。按版本顺序遍历rowset。满足如下条件，Cumulative Point向下移动，否则break：
   - 当前rowset是非overlap类型，
   - rowset磁盘空间大于base版本的2%或大于等于2G，
   ### 2.2 选取Cumulative Compaction要合并的Rowset
   ### 2.2.1 候选集candidate_rowsets收集，(与原流程一致)
   - 候选集选择先大于等于Cumulative Point所有版本。并进行排序。
   - 从候选集中收集要合并rowset版本，当compaction_score大于等于1000或遍历到最后时停止收集。
   - 当candidate_rowsets存在删除条件时，收集删除之前的版本。将Cumulative Point移动到删除rowset之后。
    ### 2.2.2 创建input_rowsets (调整部分)
    **定义：**
   - 当候选集candidate_rowsets磁盘总数s_total，
   - 候选集candidate_rowsets中第一个版本的磁盘空间s1。
   - level_size（n）表示n到能够映射到的磁盘大小分段[1G~2G) , [512m~1G), [256m~512m), [128m~256m), [64m~128m)
   
   **执行过程：**
   - 当s_total大于等于base版本的2% 或者大于等于2G或者s_total <= 64m，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) >= level_size (s1) 时，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) < level_size (s1)  时，candidate_rowsets去掉第一个版本后，回到第一步，循环该流程。
   
   ### 2.2.3 检查input_rowsets，(与原流程一致)
   - 当input_rowsets的compaction_score小于最小分数5时，input_rowsets清空，不合并。
   - 检查input_rowsets为空时，按时间补充检查，大于1天未执行合并，并且candidate_rowsets中有overlap类型的rowset时，candidate_rowsets直接赋值给input_rowsets，退出。
   
   ### 2.3 执行合并
   ### 2.4 Cumulative Point更新(调整部分)
   新产生rowset的磁盘空间大于base版本rowset的2%或者大于等于2G时，Cumulative Point向前移动，否则不移动。
   
   ### 合并效果评估
   ### 1 配置参考
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   ### 2 文件增长情况
   根据算法流程的描述，在最坏情况下，经过Cumulative Compaction各个文件都达到临界值没有合并的情况是log2的对数个文件。如下表所示
   ![image](https://user-images.githubusercontent.com/67053339/88304370-bfed0400-cd3a-11ea-840d-1b9d66042a75.png)
   
   ### 3 前后对比
   这里包括了base compaction和cumulative compaction。假设每次导入文件大小为d，经过q轮导入后执行一次cumulative compaction，经过k次cumulative compaction后执行一次base compaction。共执行n次d的导入。这里假设k大于4
   
   ### 3.1 之前流程
   最大存在文件数量 = k + q，			// k具有不确定性
   Base Compaction次数 p = n / (k * q)
   Base IO写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q  
   Cumulative IO写入数据量 = d*n
   
   ### 3.2 改后流程
   ### 3.2.1 文件数量是相同的，不用分情况讨论：
   q * d > 2G时：
   最大存在文件数量 = k + 6
   q * d < 2G时：
   最大存在文件数量 = k * q * d / 2G + 6  
   可以避免太多小批量导致的文件数过多
   
   ### 3.2.2 Base IO写入放大情况：
   a)	当d * q * k >= 8G时：
   
          Base Compaction次数 p = n / (k * q)
          Base IO 写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k * q) / k * q
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同，
   
   b)	当d * q * k < 8G，数据量小于100G时：
   
          Base Compaction次数 p = log(1 + 8%)(n) ，这里不依赖于次数k
          Base IO 写入数据量 = d(1-n/1.08) / (1-1.08) = 18.51 * n * d – 20d
   
   这种情况下，Base Compaction IO是O(n) 级别的，明显优于之前的O(n2)的情况。这里的8% 是由于base compaction限制当大于5个版本(包括了base)时进行合并，这里按最坏情况评估。
   
   c)	d*q*k < 8G，数据量大于100G时：
   
          Base Compaction次数 p = n*d / 8G
          Base IO写入数据量 = 8G * p (p + 1) / 2 = d * n * (n+ d/8G) / 2 * (d / 8G)
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同。会带来8G/(d * q * k) 倍的提升。
   
   ### 3.2.3 Cumulative IO写入放大情况：
   q*d > 2G时：
   	
              Cumulative IO写入数据量 = n * d
   
   q*d < 2G时：
   	
              Cumulative IO写入数据量 =（log（2G / q*d）+ 1）* n * d
   
   
   ### 3.4 整体结论：
   在q * d < 2G时，Cumulative IO牺牲了log（2G / q*d）次的写放大，缩减了k - k * q * d / 2G个文件数。
   在 k*q*d <= 8G时，Base IO大量减少了写放大。
   在q*d > 2G时，与之前效果一样。 
   
   
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 edited a comment on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 edited a comment on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-669154779


   ## Self-test performance data
   base rowset size 10G
   ### Optimized version
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 2,  cumulative times = 115
   max number of files after cumulative point = 4
   max number of files before cumulative point = 16
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 2,  cumulative times = 80
   max number of files after cumulative point = 6
   max number of files before cumulative point = 9
   
   ### original version
   
   **_Import 2m files every 5s, total 1000 times:_**
   
   base compaction times = 5,  cumulative times = 125
   max number of files after cumulative point = 74
   max number of files before cumulative point = 12
   
   **_Import 42m files every 5s, total 100 times:_**
   
   base compaction times = 3,  cumulative times = 100
   max number of files after cumulative point = 95
   max number of files before cumulative point = 2
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] morningman commented on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

morningman commented on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-669208938


   > Self-test performance data
   
   Nice job. I think this test result can be put into the code comment directly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] morningman closed issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

morningman closed issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 edited a comment on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 edited a comment on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-663071982


   ### 问题描述
    目前，在BE的Compaction分为Base和Cumulative两个部分。对于增量的overlap的Rowset先由Cumulative Compaction流程合并，合并完之后交由Base Compaction流程进行合并。当有频繁的小批量不断写入时，会产生大量Rowset直接交给Base合并。这样会引发如下问题：
   （1）Base Compaction时，每次由Base合并一堆小文件数据，Base的合并效率较低（用时较长合并了很多小体积文件，Base的合并次数较多，写放大严重）。
   （2）同时存在的小文件过多，查询时会读取过多的文件进行合并，影响查询效率。
   
   ### 解决方案
    为兼顾合并效率（减少合并过程中IO操作）和减少文件数量提升查询效率。提出了如下改进方案：
   ### 1、base compaction流程
   base compaction流程不变，保持现有流程，合并从Base到Cumulative Point的版本。
   ### 2、cumulative compaction流程
   ### 2.1 Cumulative Point计算
   这里新创建或初始化的时候会执行一次。按版本顺序遍历rowset。满足如下条件，Cumulative Point向下移动，否则break：
   - 当前rowset是非overlap类型，
   - rowset磁盘空间大于base版本的2%或大于等于2G，
   ### 2.2 选取Cumulative Compaction要合并的Rowset
   ### 2.2.1 候选集candidate_rowsets收集，(与原流程一致)
   - 候选集选择先大于等于Cumulative Point所有版本。并进行排序。
   - 从候选集中收集要合并rowset版本，当compaction_score大于等于1000或遍历到最后时停止收集。
   - 当candidate_rowsets存在删除条件时，收集删除之前的版本。将Cumulative Point移动到删除rowset之后。
    ### 2.2.2 创建input_rowsets (调整部分)
    **定义：**
   - 当候选集candidate_rowsets磁盘总数s_total，
   - 候选集candidate_rowsets中第一个版本的磁盘空间s1。
   - level_size（n）表示n到能够映射到的磁盘大小分段[1G~2G) , [512m~1G), [256m~512m), [128m~256m), [64m~128m)
   
   **执行过程：**
   - 当s_total大于等于base版本的2% 或者大于等于2G或者s_total <= 64m，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) >= level_size (s1) 时，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) < level_size (s1)  时，candidate_rowsets去掉第一个版本后，回到第一步，循环该流程。
   
   ### 2.2.3 检查input_rowsets，(与原流程一致)
   - 当input_rowsets的compaction_score小于最小分数5时，input_rowsets清空，不合并。
   - 检查input_rowsets为空时，按时间补充检查，大于1天未执行合并，并且candidate_rowsets中有overlap类型的rowset时，candidate_rowsets直接赋值给input_rowsets，退出。
   
   ### 2.3 执行合并
   ### 2.4 Cumulative Point更新(调整部分)
   新产生rowset的磁盘空间大于base版本rowset的2%或者大于等于2G时，Cumulative Point向前移动，否则不移动。
   
   ### 合并效果评估
   ### 1 配置参考
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   ### 2 文件增长情况
   根据算法流程的描述，在最坏情况下，经过Cumulative Compaction各个文件都达到临界值没有合并的情况是log2的对数个文件。如下表所示
   ![image](https://user-images.githubusercontent.com/67053339/88304370-bfed0400-cd3a-11ea-840d-1b9d66042a75.png)
   
   ### 3 前后对比
   这里包括了base compaction和cumulative compaction。假设每次导入文件大小为d，经过q轮导入后执行一次cumulative compaction，经过k次cumulative compaction后执行一次base compaction。共执行n次d的导入。这里假设k大于4
   
   ### 3.1 之前流程
   最大存在文件数量 = k + q，			// k具有不确定性
   Base Compaction次数 p = n / (k * q)
   Base IO写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q  
   Cumulative IO写入数据量 = d*n
   
   ### 3.2 改后流程
   ### 3.2.1 文件数量是相同的，不用分情况讨论：
   q * d > 2G时：
   最大存在文件数量 = k + 6
   q * d < 2G时：
   最大存在文件数量 = k * q * d / 2G + 6  
   可以避免太多小批量导致的文件数过多
   
   ### 3.2.2 Base IO写入放大情况：
   a)	当d * q * k >= 8G时：
   
          Base Compaction次数 p = n / (k * q)
          Base IO 写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k * q) / k * q
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同，
   
   b)	当d * q * k < 8G，数据量小于100G时：
   
          Base Compaction次数 p = log(1 + 8%)(n) ，这里不依赖于次数k
          Base IO 写入数据量 = d(1-n/1.08) / (1-1.08) = 18.51 * n * d – 20d
   
   这种情况下，Base Compaction IO是O(n) 级别的，明显优于之前的O(n2)的情况。这里的8% 是由于base compaction限制当大于5个版本(包括了base)时进行合并，这里按最坏情况评估。
   
   c)	当d * q * k < 8G，数据量大于100G时：
   
          Base Compaction次数 p = n * d / 8G
          Base IO写入数据量 = 8G * p (p + 1) / 2 = d * n * (n+ d/8G) / 2 * (d / 8G)
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同。会带来8G/(d * q * k) 倍的提升。
   
   ### 3.2.3 Cumulative IO写入放大情况：
   q*d > 2G时：
   	
              Cumulative IO写入数据量 = n * d
   
   q*d < 2G时：
   	
              Cumulative IO写入数据量 =（log（2G / q * d）+ 1）* n * d
   
   
   ### 3.4 整体结论：
   在 q * d < 2G时，Cumulative IO牺牲了log（2G / q * d）次的写放大，缩减了k - k * q * d / 2G个文件数。
   在 k * q * d <= 8G时，Base IO大量减少了写放大。
   在 q * d > 2G时，与之前效果一样。 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 edited a comment on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 edited a comment on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-663071982


   ### 问题描述
    目前，在BE的Compaction分为Base和Cumulative两个部分。对于增量的overlap的Rowset先由Cumulative Compaction流程合并，合并完之后交由Base Compaction流程进行合并。当有频繁的小批量不断写入时，会产生大量Rowset直接交给Base合并。这样会引发如下问题：
   （1）Base Compaction时，每次由Base合并一堆小文件数据，Base的合并效率较低（用时较长合并了很多小体积文件，Base的合并次数较多，写放大严重）。
   （2）同时存在的小文件过多，查询时会读取过多的文件进行合并，影响查询效率。
   
   ### 解决方案
    为兼顾合并效率（减少合并过程中IO操作）和减少文件数量提升查询效率。提出了如下改进方案：
   ### 1、base compaction流程
   base compaction流程不变，保持现有流程，合并从Base到Cumulative Point的版本。
   ### 2、cumulative compaction流程
   ### 2.1 Cumulative Point计算
   这里新创建或初始化的时候会执行一次。按版本顺序遍历rowset。满足如下条件，Cumulative Point向下移动，否则break：
   - 当前rowset是非overlap类型，
   - rowset磁盘空间大于base版本的2%或大于等于2G，
   ### 2.2 选取Cumulative Compaction要合并的Rowset
   ### 2.2.1 候选集candidate_rowsets收集，(与原流程一致)
   - 候选集选择先大于等于Cumulative Point所有版本。并进行排序。
   - 从候选集中收集要合并rowset版本，当compaction_score大于等于1000或遍历到最后时停止收集。
   - 当candidate_rowsets存在删除条件时，收集删除之前的版本。将Cumulative Point移动到删除rowset之后。
    ### 2.2.2 创建input_rowsets (调整部分)
    **定义：**
   - 当候选集candidate_rowsets磁盘总数s_total，
   - 候选集candidate_rowsets中第一个版本的磁盘空间s1。
   - level_size（n）表示n到能够映射到的磁盘大小分段[1G-2G) , [512m-1G), [256m-512m), [128m-256m), [64m-128m)
   
   **执行过程：**
   - 当s_total大于等于base版本的2% 或者大于等于2G或者s_total <= 64m，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) >= level_size (s1) 时，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) < level_size (s1)  时，candidate_rowsets去掉第一个版本后，回到第一步，循环该流程。
   
   ### 2.2.3 检查input_rowsets，(与原流程一致)
   - 当input_rowsets的compaction_score小于最小分数5时，input_rowsets清空，不合并。
   - 检查input_rowsets为空时，按时间补充检查，大于1天未执行合并，并且candidate_rowsets中有overlap类型的rowset时，candidate_rowsets直接赋值给input_rowsets，退出。
   
   ### 2.3 执行合并
   ### 2.4 Cumulative Point更新(调整部分)
   新产生rowset的磁盘空间大于base版本rowset的2%或者大于等于2G时，Cumulative Point向前移动，否则不移动。
   
   ### 合并效果评估
   ### 1 配置参考
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   ### 2 文件增长情况
   根据算法流程的描述，在最坏情况下，经过Cumulative Compaction各个文件都达到临界值没有合并的情况是log2的对数个文件。如下表所示
   ![image](https://user-images.githubusercontent.com/67053339/88304370-bfed0400-cd3a-11ea-840d-1b9d66042a75.png)
   
   ### 3 前后对比
   这里包括了base compaction和cumulative compaction。假设每次导入文件大小为d，经过q轮导入后执行一次cumulative compaction，经过k次cumulative compaction后执行一次base compaction。共执行n次d的导入。这里假设k大于4
   
   Base IO 写放大情况可以理解成在n次导入的过程中，Base Compaction执行次数。 因为每次的Base Compaction都要对Base进行完全重新写入一次。通过比较Base Compaction次数可明显看出写放大的情况。
   
   ### 3.1 之前流程
   最大存在文件数量 = k + q，			// k具有不确定性
   Base Compaction次数 p = n / (k * q)
   Base IO写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q  
   Cumulative IO写入数据量 = d*n
   
   ### 3.2 改后流程
   ### 3.2.1 文件数量是相同的，不用分情况讨论：
   q * d > 2G时：
   最大存在文件数量 = k + 6
   q * d < 2G时：
   最大存在文件数量 = k * q * d / 2G + 6  
   可以避免太多小批量导致的文件数过多
   
   ### 3.2.2 Base IO写入放大情况：
   a)	当d * q * k >= 8G时：
   
          Base Compaction次数 p = n / (k * q)
          同时，可计算出 Base IO 写入的数据量 =  d * q * k * p * (p + 1) / 2 = d * n * (n + k * q) / k * q
   
   这种情况下，Base Compaction IO与之前的Base Compaction IO情况整体基本相同，
   
   b)	当d * q * k < 8G：
   
          Base Compaction次数 p = log(1 + 8%)(n) ，这里不依赖于次数k
          Base IO 写入数据量 = d(1-n/1.08) / (1-1.08) = 18.51 * n * d – 20d
   这种情况下，Base Compaction IO会有明显减少，
   
   ### 3.2.3 Cumulative IO写入放大情况：
   q*d > 2G时：
   	
              Cumulative IO写入数据量 = n * d
   
   q*d < 2G时：
   	
              Cumulative IO写入数据量 =（log（2G / q * d）+ 1）* n * d
   
   ### 3.4 整体结论：
   在 q * d < 2G时，Cumulative IO牺牲了log（2G / q * d）次的写放大，缩减了k - k * q * d / 2G个文件数。
   在 k * q * d <= 8G时，Base IO大量减少了写放大。
   在 q * d > 2G时，与之前效果一样。 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] ZhangYu0123 edited a comment on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

ZhangYu0123 edited a comment on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-663071982


   ### 问题描述
    目前，在BE的Compaction分为Base和Cumulative两个部分。对于增量的overlap的Rowset先由Cumulative Compaction流程合并，合并完之后交由Base Compaction流程进行合并。当有频繁的小批量不断写入时，会产生大量Rowset直接交给Base合并。这样会引发如下问题：
   （1）Base Compaction时，每次由Base合并一堆小文件数据，Base的合并效率较低（用时较长合并了很多小体积文件，Base的合并次数较多，写放大严重）。
   （2）同时存在的小文件过多，查询时会读取过多的文件进行合并，影响查询效率。
   
   ### 解决方案
    为兼顾合并效率（减少合并过程中IO操作）和减少文件数量提升查询效率。提出了如下改进方案：
   ### 1、base compaction流程
   base compaction流程不变，保持现有流程，合并从Base到Cumulative Point的版本。
   ### 2、cumulative compaction流程
   ### 2.1 Cumulative Point计算
   这里新创建或初始化的时候会执行一次。按版本顺序遍历rowset。满足如下条件，Cumulative Point向下移动，否则break：
   - 当前rowset是非overlap类型，
   - rowset磁盘空间大于base版本的2%或大于等于2G，
   ### 2.2 选取Cumulative Compaction要合并的Rowset
   ### 2.2.1 候选集candidate_rowsets收集，(与原流程一致)
   - 候选集选择先大于等于Cumulative Point所有版本。并进行排序。
   - 从候选集中收集要合并rowset版本，当compaction_score大于等于1000或遍历到最后时停止收集。
   - 当candidate_rowsets存在删除条件时，收集删除之前的版本。将Cumulative Point移动到删除rowset之后。
    ### 2.2.2 创建input_rowsets (调整部分)
    **定义：**
   - 当候选集candidate_rowsets磁盘总数s_total，
   - 候选集candidate_rowsets中第一个版本的磁盘空间s1。
   - level_size（n）表示n到能够映射到的磁盘大小分段[1G-2G) , [512m-1G), [256m-512m), [128m-256m), [64m-128m)
   
   **执行过程：**
   - 当s_total大于等于base版本的2% 或者大于等于2G或者s_total <= 64m，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) >= level_size (s1) 时，candidate_rowsets直接赋值给input_rowsets，退出。
   - 当level_size (s_total - s1) < level_size (s1)  时，candidate_rowsets去掉第一个版本后，回到第一步，循环该流程。
   
   ### 2.2.3 检查input_rowsets，(与原流程一致)
   - 当input_rowsets的compaction_score小于最小分数5时，input_rowsets清空，不合并。
   - 检查input_rowsets为空时，按时间补充检查，大于1天未执行合并，并且candidate_rowsets中有overlap类型的rowset时，candidate_rowsets直接赋值给input_rowsets，退出。
   
   ### 2.3 执行合并
   ### 2.4 Cumulative Point更新(调整部分)
   新产生rowset的磁盘空间大于base版本rowset的2%或者大于等于2G时，Cumulative Point向前移动，否则不移动。
   
   ### 合并效果评估
   ### 1 配置参考
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   ### 2 文件增长情况
   根据算法流程的描述，在最坏情况下，经过Cumulative Compaction各个文件都达到临界值没有合并的情况是log2的对数个文件。如下表所示
   ![image](https://user-images.githubusercontent.com/67053339/88304370-bfed0400-cd3a-11ea-840d-1b9d66042a75.png)
   
   ### 3 前后对比
   这里包括了base compaction和cumulative compaction。假设每次导入文件大小为d，经过q轮导入后执行一次cumulative compaction，经过k次cumulative compaction后执行一次base compaction。共执行n次d的导入。这里假设k大于4
   
   ### 3.1 之前流程
   最大存在文件数量 = k + q，			// k具有不确定性
   Base Compaction次数 p = n / (k * q)
   Base IO写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q  
   Cumulative IO写入数据量 = d*n
   
   ### 3.2 改后流程
   ### 3.2.1 文件数量是相同的，不用分情况讨论：
   q * d > 2G时：
   最大存在文件数量 = k + 6
   q * d < 2G时：
   最大存在文件数量 = k * q * d / 2G + 6  
   可以避免太多小批量导致的文件数过多
   
   ### 3.2.2 Base IO写入放大情况：
   a)	当d * q * k >= 8G时：
   
          Base Compaction次数 p = n / (k * q)
          Base IO 写入数据量 = d * q * k * p * (p + 1) / 2 = d * n * (n + k * q) / k * q
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同，
   
   b)	当d * q * k < 8G，数据量小于100G时：
   
          Base Compaction次数 p = log(1 + 8%)(n) ，这里不依赖于次数k
          Base IO 写入数据量 = d(1-n/1.08) / (1-1.08) = 18.51 * n * d – 20d
   
   这种情况下，Base Compaction IO是O(n) 级别的，明显优于之前的O(n2)的情况。这里的8% 是由于base compaction限制当大于5个版本(包括了base)时进行合并，这里按最坏情况评估。
   
   c)	当d * q * k < 8G，数据量大于100G时：
   
          Base Compaction次数 p = n * d / 8G
          Base IO写入数据量 = 8G * p (p + 1) / 2 = d * n * (n+ d/8G) / 2 * (d / 8G)
   
   这种情况下，Base Compaction IO与之前的Compaction IO情况整体基本相同。会带来8G/(d * q * k) 倍的提升。
   
   ### 3.2.3 Cumulative IO写入放大情况：
   q*d > 2G时：
   	
              Cumulative IO写入数据量 = n * d
   
   q*d < 2G时：
   	
              Cumulative IO写入数据量 =（log（2G / q * d）+ 1）* n * d
   
   
   ### 3.4 整体结论：
   在 q * d < 2G时，Cumulative IO牺牲了log（2G / q * d）次的写放大，缩减了k - k * q * d / 2G个文件数。
   在 k * q * d <= 8G时，Base IO大量减少了写放大。
   在 q * d > 2G时，与之前效果一样。 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org

[GitHub] [incubator-doris] chaoyli commented on issue #4164: [Proposal] Compaction rules optimization

Posted by GitBox <gi...@apache.org>.

chaoyli commented on issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164#issuecomment-666969062


   Sorry, I am not very understand the point of rule and the algorithm. 
   Can you add me WeChat to discuss it?
   WeChat : 15652918147


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org