You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/11/11 19:12:14 UTC

[GitHub] [incubator-druid] glasser opened a new issue #8853: Overlapping segments combined with load rules can lead to data loss

glasser opened a new issue #8853: Overlapping segments combined with load rules can lead to data loss
URL: https://github.com/apache/incubator-druid/issues/8853
 
 
   ### Affected Version
   
   0.15.1.  I believe it affects 0.16 and master but have not tested.
   
   ### Description
   
   Say you have load rules configured to only load data from `2019-02-01/P1M`, and you have a single segment on the interval `2019-01-20/2019-02-10`.  Note that this segment is loaded by historicals (see #5595) because it *overlaps* with the load rule. Historicals and brokers will in fact even serve queries for the end of January, because load rules are (as far as I understand) only used to decide which segments to load onto historicals, and don't affect how queries work later.
   
   Now you use batch ingestion (say, native batch ingestion with `ingestSegment` applying a filter) with output segment granularity `DAY` over the range `2019-01-01/2019-02-05`.  This will produce segments for each of the days in January plus the first four days in February.
   
   The 4 February segments will be loaded onto historicals by the load rules, but the 31 January segments will not.  This means that queries run against January from the 20th on will give results based on the "old" data before the re-ingestion, not the new data!
   
   Moreover, if configured to automatically kill unloaded segments, the new data will be permanently deleted, which means that if you change the load rules later to include the intervals of the old data, then even intervals covered by load rules will start returning "old" data.  And because they are killed, they won't get combined with the older segment when automatic compaction happens.
   
   ### Potential fixes
   
   I can think of a few classes of fixes:
   
   #### Change load rule semantics
   
   Change load rules to load any segments which overlap with loaded segments that they overshadow.  This is relatively simple and only affects one part of the code, and solves both the "immediate queries give old data" and the "increasing load periods later actively loads old data" issues.
   
   My main concern is that people might be confused by the fact that segments are loaded which don't themselves match load rules and think something is wrong with their configuration. (Maybe we could show an explanation in the new console UI when these situations happen?)
   
   #### Apply load rules in more places
   
   The coordinator could tell historicals when loading segments onto them that they should pretend the segment is smaller than it actually is when the segment overlaps with load rules.
   
   This would involve changing more places than "change load rule semantics", and it would not fix the "unused segments can be killed and then if load rules are changed later, old data will be served" issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org