You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/08/08 04:21:50 UTC
[GitHub] jihoonson commented on issue #6124: KafkaIndexTask can delete published segments on restart

jihoonson commented on issue #6124: KafkaIndexTask can delete published segments on restart
URL: https://github.com/apache/incubator-druid/issues/6124#issuecomment-411279909
 
 
   > This doesn't seem right: there is code specifically to handle the case where a task tries to publish a segment that some task already published. It happens all the time with replicas, and they just ignore that segment and move on to the next one.
   
   Yes, correct. The task doesn't fail at this point, but it just fails to update the metastore.
   
   > I wonder if the real reason for publish failure is that the startMetadata doesn't match up. I bet it wouldn't match up: it sounds like the task is trying to publish from the point it originally started from rather than from the point it last published.
   
   Here is the [code](https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java#L555-L581) what this error happens.
   
   ```java
                 final boolean published = publisher.publishSegments(
                     ImmutableSet.copyOf(segmentsAndMetadata.getSegments()),
                     metadata == null ? null : ((AppenderatorDriverMetadata) metadata).getCallerMetadata()
                 );
   
                 if (published) {
                   log.info("Published segments.");
                 } else {
                   log.info("Transaction failure while publishing segments, checking if someone else beat us to it.");
                   final Set<SegmentIdentifier> segmentsIdentifiers = segmentsAndMetadata
                       .getSegments()
                       .stream()
                       .map(SegmentIdentifier::fromDataSegment)
                       .collect(Collectors.toSet());
                   if (usedSegmentChecker.findUsedSegments(segmentsIdentifiers)
                                         .equals(Sets.newHashSet(segmentsAndMetadata.getSegments()))) {
                     log.info(
                         "Removing our segments from deep storage because someone else already published them: %s",
                         segmentsAndMetadata.getSegments()
                     );
                     segmentsAndMetadata.getSegments().forEach(dataSegmentKiller::killQuietly);
   
                     log.info("Our segments really do exist, awaiting handoff.");
                   } else {
                     throw new ISE("Failed to publish segments[%s]", segmentsAndMetadata.getSegments());
                   }
                 }
   ```
   
   `published` is false, so the task checks the segments it failed to publish are already being used using `usedSegmentChecker` which is the `ActionBasedUsedSegmentChecker` in this case. And `usedSegmentChecker.findUsedSegments(segmentsIdentifiers).equals(Sets.newHashSet(segmentsAndMetadata.getSegments()))` returns false and the task throws an exception. This is because the restarted task generated more segments after restarting. I've verified this by comparing the publishing segments before and after restart.
   
   I didn't see any logs related to metadata mismatch. 
   
   > In (7) the task should not have removed the published segments (this is the biggest bug).
   
   Not sure about this. This can make publishing segments non-atomic as well as potential garbage segments. Probably solving (3) would be enough because this should never happen?
   
   > In (3) the task should have done something smarter instead of restoring a setup that couldn't possibly work out.
   
   Yes, I think we should keep the task states in local disk which represents what the task was doing. Also, for `usedSegmentChecker.findUsedSegments(segmentsIdentifiers).equals(Sets.newHashSet(segmentsAndMetadata.getSegments()))`, we can change this to check `usedSegments` _include_ `segmentsAndMetadata.getSegments()` and continue publishing the segments not in `usedSegments`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org