You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/24 05:54:30 UTC

[GitHub] [druid] maytasm opened a new pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are old

maytasm opened a new pull request #11025:
URL: https://github.com/apache/druid/pull/11025


   Add an option for ingestion task to drop (mark unused) segments that are old
   
   ### Description
   
   Will write more details and use cases here tomorrow morning.
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
clintropolis commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812217423


   Perhaps the mechanism in #10676 could be adapted to be used to just block until segments are loaded before marking them unused? (I haven't looked closely at either of these PRs myself)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812199543


   Hmm, it's important that the old segments not be marked unused until after the new segments are loaded. The reason is that when the old segments are marked unused, they'll be dropped from historicals on the next coordinator run. The new segments might not be loaded by then. It'll lead to a period of data unavailability. Does this PR handle that case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812223044


   Probably #10676 doesn't help much for this problem because this PR will mark old segments as unused as a part of publishing segments.
   
   > Hmm..if as part of the ingestion task we insert metadata-only (fake) empty segment and trick the coordinator in thinking that it is already loaded (so that we don't try to load it). Then the coordinator will only drop the old segments when the other new (real) segments are loaded. I guess we also have to make the query path skip these fake segments. Do you think this is about the right track? or is it over-complicating the solution?
   
   Publishing empty segments sounds like an easy fix. I think it will be less complicated than you described. We don't have to trick the coordinator, but should make it to skip loading empty segments. The coordinator will mark the segment overshadowed by empty ones as unused no matter whether they are loaded in historicals. The broker will not be aware of empty segments since historicals won't load nor announce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605235412



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 

Review comment:
       ```suggestion
     start and end within your `granularitySpec`'s intervals.  This applies whether or not the new data covers all existing segments. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 

Review comment:
       ```suggestion
   `dropExisting` only applies when `appendToExisting` is false and the  `granularitySpec` contains an `interval`. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true

Review comment:
       ```suggestion
     The following examples demonstrate when to set the `dropExisting` property to true in the `ioConfig`:
   ```

##########
File path: docs/ingestion/compaction.md
##########
@@ -52,7 +52,7 @@ In cases where you require more control over compaction, you can manually submit
 See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks.
 
 ## Data handling with compaction
-During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
+During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes. Note that compaction task automatically set `dropExisting` flag of the underlying ingestion task to true. This means that compaction task would drop (mark unused) all existing segments that are fully contain by the `interval` in the compaction task. This is to handle when compaction task changes segmentGranularity of the existing data to a finer segmentGranularity and the set of new segments (with the new segmentGranularity) does not fully cover the original croaser granularity time interval (as t
 here may not be data in every time chunk of the new finer segmentGranularity). 

Review comment:
       ```suggestion
   During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
   
   For compaction tasks, `dropExisting` for underlying ingestion tasks is "true". This means that Druid can drop or mark unused all the un-compacted segments fully within interval for the compaction task. For an example of why this is important, see the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations). 
   ```
   I think it is better not to clutter this section with an example, especially if you can't change the value. The customer doesn't need to figure out how to set it another way. If they want to understand, they can read the example in `native-batch.md`. I had to add the header in because the recommendations don't relate to the compression header.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed

Review comment:
       ```suggestion
     - Example 2: Consider the case where you want to re-ingest or overwrite a datasource and the new data does not contains some time intervals that exist
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:

Review comment:
       ```suggestion
      Unless you set `dropExisting` to true, the result after ingestion with overwrite using the same MONTH segmentGranularity would be:
      January: 1 record
      February: 10 records
      March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 

Review comment:
       ```suggestion
      This is incorrect since the new data has 0 records for January. Setting `dropExisting` to true to drop the original
      segment for Janurary that is not needed since the newly ingested data has no records for January.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 

Review comment:
       ```suggestion
     overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data using the finer segmentGranularity of MONTH. 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 

Review comment:
       ```suggestion
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.

Review comment:
       ```suggestion
     You want to re-ingest and overwrite with new data as follows:
     January: 0 records
     February: 10 records
     March: 9 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 

Review comment:
       ```suggestion
    Druid cannot drop the original YEAR segment even if it does include all the replacement. Set `dropExisting` to true in this case to drop the original segment at year `segmentgGranularity` since you no longer need it.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01

Review comment:
       ```suggestion
     If the replacement data does not have a record within every months from 2020-01-01 to 2021-01-01
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 

Review comment:
       ```suggestion
      in the datasource. For example, a datasource contains the following data at MONTH segmentGranularity:
      January: 1 record
      February: 10 records
      March: 10 records
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.

Review comment:
       ```suggestion
   
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +213,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       ```suggestion
   |dropExisting|If `true` and `appendToExisting` is `false` and the `granularitySpec` contains an`interval`, then the ingestion task drops (mark unused) all existing segments fully contained by the specified `interval` when the task publishes new segments. If ingestion fails, Druid does not drop or mark unused any segments. In the case of misconfiguration where either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec`, Druid does not drop any segments even if `dropExisting` is `true`.|false|no|
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 

Review comment:
       ```suggestion
     - Example 1: Consider an existing segment with an interval of 2020-01-01 to 2021-01-01 and YEAR segmentGranularity. You want to 
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 

Review comment:
       ```suggestion
    
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +741,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       same as line 216

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original
+   segment of Janurary which is no longer needed (as new ingested data does not have any data in Janurary).

Review comment:
       ```suggestion
     
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356505



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605156580



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalInsertAction.java
##########
@@ -301,11 +317,12 @@ public boolean isAudited()
   public String toString()
   {
     return "SegmentTransactionalInsertAction{" +
-           "segmentsToBeOverwritten=" + SegmentUtils.commaSeparatedIdentifiers(segmentsToBeOverwritten) +
-           ", segments=" + SegmentUtils.commaSeparatedIdentifiers(segments) +
+           "segmentsToBeOverwritten=" + segmentsToBeOverwritten +

Review comment:
       Yep. Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357590



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605184280



##########
File path: server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java
##########
@@ -108,7 +110,7 @@ public IndexerSQLMetadataStorageCoordinator(
     this.connector = connector;
   }
 
-  enum DataSourceMetadataUpdateResult
+  enum DataStoreMetadataUpdateResult

Review comment:
       Previously, in IndexerSQLMetadataStorageCoordinator we only update the datasource table, hence the result enum is named DataSourceMetadataUpdateResult. However, now the IndexerSQLMetadataStorageCoordinator have methods to update both the datasource table and the segment table in metadata store. Instead of having two enum which are pretty much the same, I just rename this enum so that it can be use for handling the result of updating any/both tables in metadata store...no matter which tables. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357227



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812223044


   Probably #10676 doesn't help much for this problem because this PR will mark old segments as unused as a part of publishing segments.
   
   > Hmm..if as part of the ingestion task we insert metadata-only (fake) empty segment and trick the coordinator in thinking that it is already loaded (so that we don't try to load it). Then the coordinator will only drop the old segments when the other new (real) segments are loaded. I guess we also have to make the query path skip these fake segments. Do you think this is about the right track? or is it over-complicating the solution?
   
   Publishing empty segments sounds like an easy fix. I think it will be less complicated than you described. We don't have to trick the coordinator, but should make it to skip loading empty segments. The coordinator will mark the segments as unused if they are overshadowed by empty ones no matter whether empty ones are loaded in historicals. The broker will not be aware of empty segments since historicals won't load nor announce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357747



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356602



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
zachjsh commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r603783890



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,8 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+  You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing data 

Review comment:
       Nit: You mentioned that is only occurs if a few other configs are set as well, appendToExisting = false, and granularitySpec set, I believe. Should we mention that here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605184280



##########
File path: server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java
##########
@@ -108,7 +110,7 @@ public IndexerSQLMetadataStorageCoordinator(
     this.connector = connector;
   }
 
-  enum DataSourceMetadataUpdateResult
+  enum DataStoreMetadataUpdateResult

Review comment:
       Previously, in IndexerSQLMetadataStorageCoordinator we only update the datasource table, hence the result enum is named DataSourceMetadataUpdateResult. However, now the IndexerSQLMetadataStorageCoordinator have methods to update both the datasource table and the segment table in metadata store. Instead of having two enum which are pretty much the same, I just rename this enum so that it can be use for handling the result of updating the metadata store...no matter which tables. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812201239


   > Hmm, it's important that the old segments not be marked unused until after the new segments are loaded. The reason is that when the old segments are marked unused, they'll be dropped from historicals on the next coordinator run. The new segments might not be loaded by then. It'll lead to a period of data unavailability. Does this PR handle that case?
   
   In a Coordinator run, the RunRules happen before the UnloadUnusedSegments. Hence, we should be loading the new segments before dropping the existing segments. Hmm...or is the worry if the loading fails?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
zachjsh commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605260806



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {

Review comment:
       oh I see ok, makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605182230



##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +195,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       Done

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +723,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812203994


   > In a Coordinator run, the RunRules happen before the UnloadUnusedSegments. Hence, we should be loading the new segments before dropping the existing segments.
   
   I'm not sure that it's guaranteed that this will happen. Some scenarios that come to mind:
   
   - It's possible the historical load queues are maxed out (they are at maxSegmentsInNodeLoadingQueue) and so the coordinator may not issue load commands for all of the new segments.
   - I think segment loading is asynchronous, so even if all load commands are issued, they might not have finished executing by the time UnloadUnusedSegments runs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei merged pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jon-wei merged pull request #11025:
URL: https://github.com/apache/druid/pull/11025


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605355879



##########
File path: docs/ingestion/compaction.md
##########
@@ -52,7 +52,7 @@ In cases where you require more control over compaction, you can manually submit
 See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks.
 
 ## Data handling with compaction
-During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
+During compaction, Druid overwrites the original set of segments with the compacted set. During compaction Druid locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes. Note that compaction task automatically set `dropExisting` flag of the underlying ingestion task to true. This means that compaction task would drop (mark unused) all existing segments that are fully contain by the `interval` in the compaction task. This is to handle when compaction task changes segmentGranularity of the existing data to a finer segmentGranularity and the set of new segments (with the new segmentGranularity) does not fully cover the original croaser granularity time interval (as t
 here may not be data in every time chunk of the new finer segmentGranularity). 

Review comment:
       LGTM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
zachjsh commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r603786164



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {

Review comment:
       I would think that the segments returned from the RetrieveUsedSegmentsAction are gaurenteed to be within the interval specified, is that not the case? If so, do we need to check again that the segment is in the interval?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357293



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605154810



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {

Review comment:
       That is not the case. For example, if you have a segment with interval 2000-01-01/2001-01-01 and your interval specified is 2000-04-28/2000-04-29. Then if the above segment has data for the day 2000-04-28/2000-04-29, it would be returned by RetrieveUsedSegmentsAction. However, we cannot drop this segment since the interval specified is 2000-04-28/2000-04-29. We can only drop segments that starts and ends within the interval specified. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812201239


   > Hmm, it's important that the old segments not be marked unused until after the new segments are loaded. The reason is that when the old segments are marked unused, they'll be dropped from historicals on the next coordinator run. The new segments might not be loaded by then. It'll lead to a period of data unavailability. Does this PR handle that case?
   
   In a Coordinator run, the RunRules happen before the UnloadUnusedSegments. Hence, we should be loading the new segments before dropping the existing segments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605196344



##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +195,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-811629492


    Thanks for the review @jon-wei and @techdocsmith. I have incorporated @techdocsmith doc suggestions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356846



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605358137



##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +213,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       LGTM. Done
   
   

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +741,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|

Review comment:
       LGTM. Done
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605190128



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalInsertAction.java
##########
@@ -63,13 +62,16 @@
   private final DataSourceMetadata endMetadata;
   @Nullable
   private final String dataSource;
+  @Nullable
+  private final Set<DataSegment> segmentsToBeDropped;

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356704



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jon-wei commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605324756



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {
+            segmentsFoundForDrop.add(segment);

Review comment:
       Ah yes, you're right, should be a `break`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357985



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original
+   segment of Janurary which is no longer needed (as new ingested data does not have any data in Janurary).

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605158227



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {
+            segmentsFoundForDrop.add(segment);

Review comment:
       Good idea. Btw I think it should be a `break` instead of a `continue`. This is to break out of the `for (Interval interval : condensedIntervals) {` when we confirmed that the segment is within one of the `interval` specified in `granularitySpec`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812259294


   > I meant, that might not be what users expect. I think they will want to know when they can query the new data without seeing old one. Maybe this can be addressed using #10676.
   > 
   > > One problem might be auto compaction as it would try to compact the same interval again.
   > 
   > Can you elaborate more on this problem? I'm not sure why it would compact the same interval again.
   
   Ah I see. For compaction task, the lag in dropping old segments would not be a problem for querying. For new data without seeing old data, I agree with you that it can be addressed using mechanism in https://github.com/apache/druid/pull/10676.
   
   The problem with auto compaction is when we change segment granularity to a smaller granularity (for example MONTH to DAY) and we do not have data in all time chunk of the new granularity (for example, we only have a DAY of data in the MONTH interval). In this case, if the old segment is not marked unused by the time the auto compaction runs in the next cycle, the auto compaction would see the old segment as needing compaction and try to compact the same interval again (auto compaction would see both the new DAYS segments and old MONTH segments). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812212391


   > > In a Coordinator run, the RunRules happen before the UnloadUnusedSegments. Hence, we should be loading the new segments before dropping the existing segments.
   > 
   > I'm not sure that it's guaranteed that this will happen. Some scenarios that come to mind:
   > 
   > * It's possible the historical load queues are maxed out (they are at maxSegmentsInNodeLoadingQueue) and so the coordinator may not issue load commands for all of the new segments.
   > * I think segment loading is asynchronous, so even if all load commands are issued, they might not have finished executing by the time UnloadUnusedSegments runs.
   
   You're right. Seems like there is no easy way of marking old segments as used=false before the new segments are loaded. Changing logic in loading/dropping will be too complicated. Maybe we should make it generate empty segments with new granularity for all the missing gaps. And let the loading/dropping handle it as normal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605180667



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,8 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+  You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing data 

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357874



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 
+   and the user would expect to see that Jan has 0 record. By setting `dropExisting` flag to true, we can drop the original

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357828



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.
+   Without setting `dropExisting` to true, the result after ingestion with overwrite (using the same MONTH segmentGranularity) would be:
+   `Jan has 1 record, Feb has 10 records, Mar has 9 records`. However, this is incorrect as the new data has 0 record for Jan 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812259294


   > I meant, that might not be what users expect. I think they will want to know when they can query the new data without seeing old one. Maybe this can be addressed using #10676.
   > 
   > > One problem might be auto compaction as it would try to compact the same interval again.
   > 
   > Can you elaborate more on this problem? I'm not sure why it would compact the same interval again.
   
   Ah I see. For compaction task, the lag in dropping old segments would not be a problem for querying. For new data without seeing old data, I agree with you that it can be addressed using mechanism in https://github.com/apache/druid/pull/10676.
   
   I was mistaken. We will not have any problem in auto compaction with the fake empty segment. As the fake empty segment + the new real segments would fully overshadow the old segment. Marking old segments as unused is enough for auto compaction to not run on the same interval again even if old segments are drop later 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356929



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-811629492


   @jon-wei Thanks for the review. I have incorporated @techdocsmith doc suggestions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356279



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 

Review comment:
       LGTM. Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812217227


   > I just wish there was a way to do it without really generating physical empty segments. Maybe empty segments that are somehow metadata-only?
   
   Hmm..if as part of the ingestion task we insert metadata-only (fake) empty segment and trick the coordinator in thinking that it is already loaded (so that we don't try to load it). Then the coordinator will only drop the old segments when the other new (real) segments are loaded. I guess we also have to make the query path skip these fake segments. Do you think this is about the right track? or is it over-complicating the solution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812232570


   I meant, that might not be what users expect. I think they will want to know when they can query the new data without seeing old one. Maybe this can be addressed using #10676.
   
   >  One problem might be auto compaction as it would try to compact the same interval again.
   
   Can you elaborate more on this problem? I'm not sure why it would compact the same interval again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jon-wei commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r603768694



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalInsertAction.java
##########
@@ -301,11 +317,12 @@ public boolean isAudited()
   public String toString()
   {
     return "SegmentTransactionalInsertAction{" +
-           "segmentsToBeOverwritten=" + SegmentUtils.commaSeparatedIdentifiers(segmentsToBeOverwritten) +
-           ", segments=" + SegmentUtils.commaSeparatedIdentifiers(segments) +
+           "segmentsToBeOverwritten=" + segmentsToBeOverwritten +

Review comment:
       Should this preserve the `SegmentUtils.commaSeparatedIdentifiers(` and apply it to `segmentsToBeDropped` as well?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -719,6 +723,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be "index".|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       ```suggestion
   |dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|
   ```

##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/actions/SegmentTransactionalInsertAction.java
##########
@@ -63,13 +62,16 @@
   private final DataSourceMetadata endMetadata;
   @Nullable
   private final String dataSource;
+  @Nullable
+  private final Set<DataSegment> segmentsToBeDropped;

Review comment:
       Can you add some javadocs to this class that explain the difference between `segmentsToBeOverwritten` and `segmentsToBeDropped`?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,8 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+  You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing data 

Review comment:
       Suggest adding an example where the user would want to enable it, such as the YEAR/MONTH example in the PR description

##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +195,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       ```suggestion
   |dropExisting|If set to true (and `appendToExisting` is set to false and `interval` is specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contained by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be dropped (marked unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be dropped even if `dropExisting` is set to `true`.|false|no|
   ```

##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {
+            segmentsFoundForDrop.add(segment);

Review comment:
       Could add a `continue` here once a segment is found for efficiency

##########
File path: server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java
##########
@@ -108,7 +110,7 @@ public IndexerSQLMetadataStorageCoordinator(
     this.connector = connector;
   }
 
-  enum DataSourceMetadataUpdateResult
+  enum DataStoreMetadataUpdateResult

Review comment:
       What's the reasoning behind the rename?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -193,6 +195,7 @@ that range if there's some stray data with unexpected timestamps.
 |type|The task type, this should always be `index_parallel`.|none|yes|
 |inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes|
 |appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no|
+|dropExisting|If set to true (and `appendToExisting` set to false and `interval` specified in `granularitySpec`), then the ingestion task would drop (mark unused) all existing segments that are fully contain by the `interval` in the `granularitySpec` when the task publishes new segments (no segments would be drop (mark unused) if the ingestion fails). Note that if either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec` then no segments would be drop even if `dropExisting` is set to `true`.|false|no|

Review comment:
       Suggest noting somewhere that compaction tasks will have this enabled




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812213778


   I just wish there was a way to do it without really generating physical empty segments. Maybe empty segments that are somehow metadata-only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812232570


   I meant, that might not be what users expect. I think they will want to know when they can query the new data without seeing old one. Maybe this can be addressed using #10676.
   
   >  One problem might be auto compaction as it would try to compact the same interval again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812230337


   > A problem I can think of is that deleting old segments won't happen at once. Instead, the atomicity of deletion will be guaranteed inside each time chunk.
   
   Why is this a problem? Old segments being available should not be a problem as long as it is eventually drop. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605152823



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,8 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+  You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing data 

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357370



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812224199


   A problem I can think of is that deleting old segments won't happen at once. Instead, the atomicity of deletion will be guaranteed inside each time chunk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm edited a comment on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm edited a comment on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812230337


   > A problem I can think of is that deleting old segments won't happen at once. Instead, the atomicity of deletion will be guaranteed inside each time chunk.
   
   Why is this a problem? Old segments being available should not be a problem as long as it is eventually drop. One problem might be auto compaction as it would try to compact the same interval again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #11025: Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #11025:
URL: https://github.com/apache/druid/pull/11025#issuecomment-812221416


   > Perhaps the mechanism in #10676 could be adapted to be used to just block until segments are loaded before marking them unused? (I haven't looked closely at either of these PRs myself)
   
   Currently, the functionality of marking existing segment as unused is done transactionally with publishing segments. This means that the job fails if either marking unused or publishing fail (we don't mark unused if publish fails and don't publish if marking unused fails). I think if we go with https://github.com/apache/druid/pull/10676 we have to make marking segment unused as a best-effort. We would have to publish the new segments first, then wait for the loading of new segment (by mechanism in https://github.com/apache/druid/pull/10676), then only if that succeed, we can mark the existing segments as unused. However, the mechanism in https://github.com/apache/druid/pull/10676 can timeout. This means that we cannot mark existing segments as unused since it is not confirmed if new segments is loaded yet or not. In those cases, task would still succeed and new segments would be punished but marking existing segments unused would fails.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357502



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605356789



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605155633



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractBatchIndexTask.java
##########
@@ -490,6 +491,27 @@ public static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConf
     }
   }
 
+  public static Set<DataSegment> getUsedSegmentsWithinInterval(
+      TaskToolbox toolbox,
+      String dataSource,
+      List<Interval> intervals
+  ) throws IOException
+  {
+    Set<DataSegment> segmentsFoundForDrop = new HashSet<>();
+    List<Interval> condensedIntervals = JodaUtils.condenseIntervals(intervals);
+    if (!intervals.isEmpty()) {
+      Collection<DataSegment> usedSegment = toolbox.getTaskActionClient().submit(new RetrieveUsedSegmentsAction(dataSource, null, condensedIntervals, Segments.ONLY_VISIBLE));
+      for (DataSegment segment : usedSegment) {
+        for (Interval interval : condensedIntervals) {
+          if (interval.contains(segment.getInterval())) {

Review comment:
       Actually let me make the doc a little clearer than we are only dropping segments that starts and ends within the `interval` specified in `granularitySpec`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #11025: Add an option for ingestion task to drop (mark unused) segments that are of the interval in the ingestionSpec

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #11025:
URL: https://github.com/apache/druid/pull/11025#discussion_r605357684



##########
File path: docs/ingestion/native-batch.md
##########
@@ -89,6 +89,26 @@ You may want to consider the below things:
   data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have
   no data written by this task, they will be left alone. If any existing segments partially overlap with the
   `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible.
+- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that 
+  start and end within your `granularitySpec`'s intervals, regardless of if new data are in existing segments or not 
+  (this is only applicable if `appendToExisting` is set to false and `interval` specified in `granularitySpec`). 
+  
+  Here are some examples on when to set `dropExisting` flag in the `ioConfig` to true
+  
+  - Example 1: Existing segment has a interval of 2020-01-01 to 2021-01-01 (YEAR segmentGranularity) and we are trying to 
+  overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data in smaller segmentGranularity, MONTH. 
+  If new data we are ingesting does not have data in all 12 months from 2020-01-01 to 2021-01-01
+  (even if it does have data in every month of the existing data), then this would then prevent the original YEAR segment 
+  from being dropped. By setting `dropExisting` flag to true, we can drop the original 2020-01-01 to 2021-01-01 
+  (YEAR segmentGranularity) segment, which is no longer needed.
+  - Example 2: Re-ingesting/overwriting a datasource and the new data does not contains time intervals that already existed
+   in the datasource. For example, if a user has the following MONTH segmentGranularity data: `Jan has 1 record, Feb has 10 records, Mar has 10 records` 
+   in the datasource. Now the user is trying to re-ingest with new data that overwrites all the existing data. 
+   The new data has the following data for each month: `Jan has 0 record, Feb has 10 records, Mar has 9 records`.

Review comment:
       LGTM. Done
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org