You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/26 00:32:11 UTC

[GitHub] [druid] jihoonson commented on a change in pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

jihoonson commented on a change in pull request #10676:
URL: https://github.com/apache/druid/pull/10676#discussion_r601926994



##########
File path: indexing-service/src/test/java/org/apache/druid/indexing/common/task/CompactionTaskTest.java
##########
@@ -1447,6 +1452,7 @@ private void assertIngestionSchema(
             null,
             null,
             null,
+            null,
             null

Review comment:
       Interesting point. I think there are some things we should think about first.
   
   - It's true that currently compaction doesn't change the underlying data much, but it can make some changes such as filtering out some unnecessary dimensions or adding new metrics. You can also change the query granularity now. In the future, I can imagine that you can even transform your data using compaction with a new support for transformSpec.
   - The compaction task is a bit special and different from other batch tasks in how it publishes segments. All other batch tasks can push segments in the middle of indexing, but should publish all those segments at the end of indexing. However, the compaction task can process each time chunk at a time when there is no change in segment granularity. In this case, it can publish segments whenever it finishes processing individual time chunk. It can also go through all time chunks even when there are some time chunks that it fails to compact. The final task status will be `FAILED` when it succeeds to compact only some time chunks but fails for others.
   - Compacting datasources is usually not the single-shot type job. Rather, you would run multiple small compaction tasks over time as in auto compaction. In that case, you would want to know what time chunks are compacted and what are not, so that you can determine what result you can get when you query certain time chunks. For the compaction that is manually set up outside druid, tracking of individual compaction tasks could be useful for this purpose. However, for auto compaction, it won't provide much value since compaction tasks are submitted by the coordinator not users. So, we need another way such as adding a new coordinator API that returns such compaction status.
   
   From these, we would probably want something similar but different for compaction from the one proposed here. I would suggest to do it in a different PR. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org