You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by jo...@apache.org on 2019/01/24 00:21:51 UTC

[incubator-druid] branch master updated: Improve doc for auto compaction (#6782)

This is an automated email from the ASF dual-hosted git repository.

jonwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git


The following commit(s) were added to refs/heads/master by this push:
     new 3b020fd  Improve doc for auto compaction (#6782)
3b020fd is described below

commit 3b020fd81bebecf52dc7edb48047008052603a71
Author: Jihoon Son <ji...@apache.org>
AuthorDate: Wed Jan 23 16:21:45 2019 -0800

    Improve doc for auto compaction (#6782)
    
    * Improve doc for auto compaction
    
    * address comments
    
    * address comments
    
    * address comments
---
 docs/content/configuration/index.md | 19 +++++++++++-------
 docs/content/design/coordinator.md  | 40 ++++++++++++++++++++++++-------------
 2 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/docs/content/configuration/index.md b/docs/content/configuration/index.md
index fa66424..6d80de3 100644
--- a/docs/content/configuration/index.md
+++ b/docs/content/configuration/index.md
@@ -828,14 +828,14 @@ A description of the compaction config is:
 |--------|-----------|--------|
 |`dataSource`|dataSource name to be compacted.|yes|
 |`keepSegmentGranularity`|Set [keepSegmentGranularity](../ingestion/compaction.html) to true for compactionTask.|no (default = true)|
-|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compact task.|no (default = 25)|
-|`inputSegmentSizeBytes`|Total input segment size of a compactionTask.|no (default = 419430400)|
-|`targetCompactionSizeBytes`|The target segment size of compaction. The actual size of a compact segment might be slightly larger or smaller than this value. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400 if `maxRowsPerSegment` is not specified)|
+|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compaction task.|no (default = 25)|
+|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.|no (default = 419430400)|
+|`targetCompactionSizeBytes`|The target segment size, for each segment, after compaction. The actual sizes of compacted segments might be slightly larger or smaller than this value. Each compaction task may generate more than one output segment, and it will try to keep each output segment close to this configured size. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400)|
 |`maxRowsPerSegment`|Max number of rows per segment after compaction. This configuration cannot be used together with `targetCompactionSizeBytes`.|no|
-|`maxNumSegmentsToCompact`|Max number of segments to compact together.|no (default = 150)|
+|`maxNumSegmentsToCompact`|Maximum number of segments to compact together per compaction task. Since a time chunk must be processed in its entirety, if a time chunk has a total number of segments greater than this parameter, compaction will not run for that time chunk.|no (default = 150)|
 |`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")|
-|`tuningConfig`|Tuning config for compact tasks. See [Compaction TuningConfig](#compaction-tuningconfig).|no|
-|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compact tasks.|no|
+|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#compact-task-tuningconfig).|no|
+|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compaction tasks.|no|
 
 An example of compaction config is:
 
@@ -845,7 +845,12 @@ An example of compaction config is:
 }
 ```
 
-For realtime dataSources, it's recommended to set `skipOffsetFromLatest` to some sufficiently large value to avoid frequent compact task failures.
+Note that compaction tasks can fail if their locks are revoked by other tasks of higher priorities.
+Since realtime tasks have a higher priority than compaction task by default,
+it can be problematic if there are frequent conflicts between compaction tasks and realtime tasks.
+If this is the case, the coordinator's automatic compaction might get stuck because of frequent compaction task failures.
+This kind of problem may happen especially in Kafka/Kinesis indexing systems which allow late data arrival.
+If you see this problem, it's recommended to set `skipOffsetFromLatest` to some large enough value to avoid such conflicts between compaction tasks and realtime tasks.
 
 ##### Compaction TuningConfig
 
diff --git a/docs/content/design/coordinator.md b/docs/content/design/coordinator.md
index 90f5e28..e8dea29 100644
--- a/docs/content/design/coordinator.md
+++ b/docs/content/design/coordinator.md
@@ -69,34 +69,46 @@ Each run, the Druid coordinator compacts small segments abutting each other. Thi
 segments which may degrade the query performance as well as increasing the disk space usage.
 
 The coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy).
-Once some segments are found, it launches a [compact task](../ingestion/tasks.html#compaction-task) to compact those segments.
-The maximum number of running compact tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
-Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compact task is always submitted
+Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments.
+The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
+Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compaction task is always submitted
 if the compaction is enabled for a dataSource.
 See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) to enable the compaction.
 
-Compact tasks might fail due to some reasons.
+Compaction tasks might fail due to the following reasons.
 
-- If the input segments of a compact task are removed or overshadowed before it starts, that compact task fails immediately.
-- If a task of a higher priority acquires a lock for an interval overlapping with the interval of a compact task, the compact task fails.
+- If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately.
+- If a task of a higher priority acquires a lock for an interval overlapping with the interval of a compaction task, the compaction task fails.
 
-Once a compact task fails, the coordinator simply finds the segments for the interval of the failed task again, and launches a new compact task in the next run.
+Once a compaction task fails, the coordinator simply finds the segments for the interval of the failed task again, and launches a new compaction task in the next run.
 
 ### Segment Search Policy
 
 #### Newest Segment First Policy
 
-This policy searches the segments of _all dataSources_ in inverse order of their intervals.
-For example, let me assume there are 3 dataSources (`ds1`, `ds2`, `ds3`) and 5 segments (`seg_ds1_2017-10-01_2017-10-02`, `seg_ds1_2017-11-01_2017-11-02`, `seg_ds2_2017-08-01_2017-08-02`, `seg_ds3_2017-07-01_2017-07-02`, `seg_ds3_2017-12-01_2017-12-02`) for those dataSources.
-The segment name indicates its dataSource and interval. The search result of newestSegmentFirstPolicy is [`seg_ds3_2017-12-01_2017-12-02`, `seg_ds1_2017-11-01_2017-11-02`, `seg_ds1_2017-10-01_2017-10-02`, `seg_ds2_2017-08-01_2017-08-02`, `seg_ds3_2017-07-01_2017-07-02`].
+At every coordinator run, this policy searches for segments to compact by iterating segments from the latest to the oldest.
+Once it finds the latest segment among all dataSources, it checks if the segment is _compactible_ with other segments of the same dataSource which have the same or abutting intervals.
+Note that segments are compactible if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.
 
-Every run, this policy starts searching from the (very latest interval - [skipOffsetFromLatest](../configuration/index.html#compaction-dynamic-configuration)).
-This is to handle the late segments ingested to realtime dataSources.
+Here are some details with an example. Let us assume we have two dataSources (`foo`, `bar`)
+and 5 segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`, `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`, `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION`, `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`, `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`).
+When each segment has the same size of 10 MB and `inputSegmentSizeBytes` is 20 MB, this policy first returns two segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`) to compact together because
+`foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION` is the latest segment and `foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` abuts to it.
+
+If the coordinator has enough task slots for compaction, this policy would continue searching for the next segments and return
+`bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`.
+Note that `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION` is not compacted together even though it abuts to `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`.
+This is because the total segment size to compact would be greater than `inputSegmentSizeBytes` if it's included.
+
+The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.html#compaction-dynamic-configuration).
+If this is set, this policy will ignore the segments falling into the interval of (the end time of the very latest segment - `skipOffsetFromLatest`).
+This is to avoid conflicts between compaction tasks and realtime tasks.
+Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.
 
 <div class="note caution">
 This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
-and their total size exceeds <a href="../configuration/index.html#compaction-dynamic-configuration">targetCompactionSizebytes</a>.
-If it finds such segments, it simply skips compacting them.
+and their total size exceeds <a href="../configuration/index.html#compaction-dynamic-configuration">inputSegmentSizeBytes</a>.
+If it finds such segments, it simply skips them.
 </div>
 
 ### The Coordinator Console


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org