You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by 陈明雨 <mo...@163.com> on 2019/10/19 12:40:39 UTC

[Proposal] Limit the memory usage of Compaction

With the widespread use of the new load framework, existing compaction strategies are no longer work in some scenarios. This document focuses on the problems that the new load framework brings to the compaction logic and how to improve it.


## Problem


In the new load framework, the load data forms a serial of `Memtables` in memory. When the size of a memtable reaches the threshold (default is 100MB), it will be written to the disk to form a `Segment`. A batch of load is corresponding ti a `Version`. When a batch of loaded data is relatively large, or a row of a table is large, a batch of load may generate thousands of segments.


In the compaction logic, at least one version is selected for one compaction. Compaction is an external sorting that will open a `RowBlock` for each segment, with 1024 rows per RowBlock. So a RowBlock occupies a memory size of (1024 * row size).


Assuming that a Compaction has 1000 Segments and each row is 4K in size, RowBlock will take up 4G memory. When multiple Compactions are running at the same time, the system OOM may be caused.


## Solution


This proposal is to ensure that Compaction can run stably with less memory by estimating and limiting the amount of memory used by Compaction. This work is divided into the following three steps.


### Compaction ratio statistic


To estimate the amount of memory used by a Compaction, it is mainly to estimate the size of a row in memory. We can simply use the ratio of the size of a memtable in memory to the size of file it is written on disk as the compaction ratio. With this ratio, the size of the data file on the disk, and the number of rows in file, we can calculate the approximate occupancy of a single row of data in memory.


### Supported compaction within a version


Currently only Compaction with at least one version is supported. And if there are too many Segments in a single version, it still consumes a lot of memory. So we need to support compaction with a subset of segments with a single version. 


### Limiting Compaction memory usage


With the previous two steps, it has been possible to estimate and limit the memory usage of a single Compaction. Finally, we need an overall limit to ensure that the memory overhead can be within a reasonable range when multiple Compactions are running at the same time.


--
此致!Best Regards
陈明雨 Mingyu Chen

Email:
chenmingyu@apache.org