You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/06/03 15:55:43 UTC

[GitHub] [accumulo] EdColeman commented on a change in pull request #2132: Make DeadCompactionDetector handle network hiccups

EdColeman commented on a change in pull request #2132:
URL: https://github.com/apache/accumulo/pull/2132#discussion_r644918105



##########
File path: server/compaction-coordinator/src/main/java/org/apache/accumulo/coordinator/DeadCompactionDetector.java
##########
@@ -84,21 +97,42 @@ private void detectDeadCompactions() {
 
     running.forEach((ecid) -> {
       if (tabletCompactions.remove(ecid) != null) {
-        log.trace("Removed {} running on a compactor", ecid);
+        log.trace("Removed compaction {} running on a compactor", ecid);
+      }
+      if (this.deadCompactions.remove(ecid) != null) {
+        log.trace("Removed {} from the dead compaction map, it's running on a compactor", ecid);
       }
     });
 
     // Determine which compactions are currently committing and remove those
     context.getAmple().getExternalCompactionFinalStates()
-        .map(ecfs -> ecfs.getExternalCompactionId()).forEach(tabletCompactions::remove);
+        .map(ecfs -> ecfs.getExternalCompactionId()).forEach(ecid -> {
+          if (tabletCompactions.remove(ecid) != null) {
+            log.trace("Removed compaction {} that is committing", ecid);
+          }
+          if (this.deadCompactions.remove(ecid) != null) {
+            log.trace("Removed {} from the dead compaction map, it's committing", ecid);
+          }
+        });
 
-    tabletCompactions
-        .forEach((ecid, extent) -> log.debug("Detected dead compaction {} {}", ecid, extent));
+    tabletCompactions.forEach((ecid, extent) -> {
+      log.debug("Possible dead compaction detected {} {}", ecid, extent);
+      this.deadCompactions.putIfAbsent(ecid, System.currentTimeMillis());
+      this.deadCompactions.merge(ecid, 1L, Long::sum);
+    });
 
     // Everything left in tabletCompactions is no longer running anywhere and should be failed.
     // Its possible that a compaction committed while going through the steps above, if so then
     // that is ok and marking it failed will end up being a no-op.
+    Set<ExternalCompactionId> toFail =
+        this.deadCompactions.entrySet().stream().filter(e -> e.getValue() > 2).map(e -> e.getKey())
+            .collect(Collectors.toCollection(TreeSet::new));
+    tabletCompactions.keySet().retainAll(toFail);
+    tabletCompactions.forEach((eci, v) -> {
+      log.warn("Compaction {} believed to be dead, failing it.", eci);

Review comment:
       Why would that be expected?  If something is repeatedly dying I'd think that would indicate that some action / investigation might be warranted.  If I had a compaction that was taking a long time - and it was because these kept dying, how would I know?
   
   Recently had a discussion  related to the trace level statements that are associated with this particular log message, wondering if they might not be better at debug (at least for some of them) - my impression is that dead / dying and removing the entries should not be common - and if they are then something should be fixed so they are not.   It might not be obvious that if you are having issues with external compaction you need to set logging to trace to obtain enough info to figure out what the issue might be.  I realize that we often log to much, but from looking at the code without understanding the frequency / how common / what a corrective approach might be - some of this code might benefit from more logging at debug or higher levels - just my initial impression, would probability need to run and then have to actually try to debug something before having something concrete to add to the discussion.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org