You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/31 21:59:31 UTC

[GitHub] [druid] techdocsmith commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

techdocsmith commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605235412



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 

Review comment:
       ```suggestion
     start and end within your `granularitySpec`'s intervals.  This applies whether or not the new data covers all existing segments. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 

Review comment:
       ```suggestion
   `dropExisting` only applies when `appendToExisting` is false and the  `granularitySpec` contains an `interval`. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true

Review comment:
       ```suggestion
     The following examples demonstrate when to set the `dropExisting` property to true in the `ioConfig`:
   ```

##########
File path: docs/ingestion/compaction.md
##########
@@ -52,7 +52,7 @@ In cases where you require more control over compaction, you can manually submit
 See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks.
 
 ## Data handling with compaction
-During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
+During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes. Note that compaction task automatically set `dropExisting` flag of the underlying ingestion task to true. This means that compaction task would drop (mark unused) all existing segments that are fully contain by the `interval` in the compaction task. This is to handle when compaction task changes segmentGranularity of the existing data to a finer segmentGranularity and the set of new segments (with the new segmentGranularity) does not fully cover the original croaser granularity time interval (as t
 here may not be data in every time chunk of the new finer segmentGranularity). 

Review comment:
       ```suggestion
   During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
   
   For compaction tasks, `dropExisting` for underlying ingestion tasks is "true". This means that Druid can drop or mark unused all the un-compacted segments fully within interval for the compaction task. For an example of why this is important, see the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations). 
   ```
   I think it is better not to clutter this section with an example, especially if you can't change the value. The customer doesn't need to figure out how to set it another way. If they want to understand, they can read the example in `native-batch.md`. I had to add the header in because the recommendations don't relate to the compression header.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed

Review comment:
       ```suggestion
     - Example 2: Consider the case where you want to re-ingest or overwrite a datasource and the new data does not contains some time intervals that exist
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:

Review comment:
       ```suggestion
      Unless you set `dropExisting` to true, the result after ingestion with overwrite using the same MONTH segmentGranularity would be:
      January: 1 record
      February: 10 records
      March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 

Review comment:
       ```suggestion
      This is incorrect since the new data has 0 records for January. Setting `dropExisting` to true to drop the original
      segment for Janurary that is not needed since the newly ingested data has no records for January.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 

Review comment:
       ```suggestion
     overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data using the finer segmentGranularity of MONTH. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 

Review comment:
       ```suggestion
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.

Review comment:
       ```suggestion
     You want to re-ingest and overwrite with new data as follows:
     January: 0 records
     February: 10 records
     March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 

Review comment:
       ```suggestion
    Druid cannot drop the original YEAR segment even if it does include all the replacement. Set `dropExisting` to true in this case to drop the original segment at year `segmentgGranularity` since you no longer need it.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01

Review comment:
       ```suggestion
     If the replacement data does not have a record within every months from 2020-01-01 to 2021-01-01
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 

Review comment:
       ```suggestion
      in the datasource. For example, a datasource contains the following data at MONTH segmentGranularity:
      January: 1 record
      February: 10 records
      March: 10 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.

Review comment:
       ```suggestion
   
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +213,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       ```suggestion
   |dropExisting|If `true` and `appendToExisting` is `false` and the `granularitySpec` contains an`interval`, then the ingestion task drops (mark unused) all existing segments fully contained by the specified `interval` when the task publishes new segments. If ingestion fails, Druid does not drop or mark unused any segments. In the case of misconfiguration where either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec`, Druid does not drop any segments even if `dropExisting` is `true`.|false|no|
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 

Review comment:
       ```suggestion
     - Example 1: Consider an existing segment with an interval of 2020-01-01 to 2021-01-01 and YEAR segmentGranularity. You want to 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +741,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       same as line 216

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original
+   segment of Janurary which is no longer needed (as new ingested data does not have any data in Janurary).

Review comment:
       ```suggestion
     
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org