You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/08/09 21:12:46 UTC

[GitHub] gianm opened a new issue #6136: Compaction and ingestion running simultaneously

gianm opened a new issue #6136: Compaction and ingestion running simultaneously
URL: https://github.com/apache/incubator-druid/issues/6136

We'd like to be able to run compaction and append-oriented ingestion for the same time, for the same time chunks. "Compaction" here means an indexing task that reads from Druid segments and writes back equivalent optimized segment(s).

Right now (0.12.x) we can run these at the same time for the same datasource, as long as the time chunk is different (different day, for example, if segment granularity is day). This is good but it cannot help in two cases:

1. Backfills (historical data loads) done through Kafka. This causes problems, especially if the historical data is coming in no particular order, because the Kafka tasks end up publishing lots of waves of small segments. Imagine loading a month of data with segment granularity "hour": that's 720 time chunks, and each one may get lots of small segments as they all fill up simultaneously.
2. Ingestion pipelines that have a long trickle of late data. Consider a situation where most data for a particular day comes in real time, but small amounts of late data come in over the next 30 days. If this late data comes in regularly enough, it becomes impossible to run a compaction task for this day until late data stops coming in. We have to wait 30 days, and during that time, queries can really slow down due to the potentially large number of tiny segments.

Both of these are challenging to address via tuning, since when faced with such data delivery patterns, we can only do so much to create optimal segments upfront. But these could be addressed by an ability to compact segments even if other segments are being written with the same interval. This has another benefit: it suggests that we can compact partial time chunks, which means that compaction doesn't necessarily need to be distributed, even for large amounts of data.

I am not sure what this should look like, but I think some things are true:

- It will need to involve some changes to how the VersionedIntervalTimeline works, since it has no ability to view some segments within an interval/version pair as obsolete, without considering them _all_ obsolete.
- It would be nice to maintain the property that VersionedIntervalTimelines can be constructed from a collection of DataSegments, which suggests that we'll be modifying the DataSegment class somehow.

Maybe something like this (not sure if this is the best design, but something that might work). Add a "replaces" list to DataSegment that looks like `"replaces" : [0, 1, 2]`. This means that DataSegment object replaces others for the same interval/version pair with partition numbers 0, 1, 2. Let's say it's partitionNum 3. So, the VersionedIntervalTimeline should return either 3 _or_ 0, 1, and 2; but not mix them. It's self-describing in the sense that once you see 3, you know to stop looking at 0, 1, and 2. It would be nice to be able to do a N -> M compaction (rather than N -> 1) but I don't think this particular design will generalize to that. Maybe that's ok.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org