You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/06/29 14:19:57 UTC

[GitHub] [accumulo] dlmarion opened a new pull request, #2792: closes #1377 - ensure all tables are checked ...

dlmarion opened a new pull request, #2792:
URL: https://github.com/apache/accumulo/pull/2792

   ... when removing references to candidates that are still in use.
   
   This is done by getting tableIds from zookeeper at the start of the
   reference scan and at the end of the reference scan.  During the
   reference scans, the list of tableID is tracked.  After all candidates
   that are still in use have been removed, there is a one more consitency
   check to ensure we looked at all tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970057023


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   I have not looked at that PR in a long time.   The DataLevel enum did not exist in 1.10 and the GC was different.  In 1.10 the Accumulo GC process did not handle the the root tablet, it was treated specially in tserver code.  Changes were made to have less special cases for the root tablet and store its metadata in ZK and treat it like a normal tablet.  Part of these changes made the GC process handle the root tablet.  So I am thinking the changes I suggested are good for the new way the GC works in 2.1. I will look at this again in depth in an IDE tomorrow to verify I am on the right track.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972356614


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new RuntimeException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);
+    }

Review Comment:
   > The canonical determination of whether a table exists or not is that it has an entry in ZK... this is created before metadata entries, and is the last thing removed when a table is deleted.
   
   Good catch, we need to consider table states to avoid this race condition.  I mentioned table states in #1377, but its been so long I had completely forgotten about that edge case and I did not reread the issue until now.
   
   When a table is created the following is done.
   
    1. table is put in ZK w/ TableState.NEW
    2. metadata table is populated
    3. tables state is set to TableState.ONLINE or TableState.OFFLINE
   
   When a table is deleted the following is done.
   
    1. Table state is set to TableState.DELETING
    2. entries are removed from metadata table
    3. entries are removed from ZK
   
   So from the perspective of GC, if we see a table with a state of TableState.ONLINE or TableState.OFFLINE before and after scanning the metadata table, then it must be seen in the metadata table unless there is a problem.
   
   We need to get a `Map<TableId,TableState>` to properly do this check.
   
   
   
     



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974132266


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +502,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });

Review Comment:
   Yeah, that was dumb on my part. I should have seen that. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] DomGarguilo commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r956210539


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   *
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      if (!tableIdsBefore.isEmpty() && tableIdsAfter.isEmpty()) {
+        throw new RuntimeException("ZK returned no table ids after scanning for references,"
+            + " maybe all the tables were deleted");
+      } else {
+        // we saw no table ids in ZK but did in the metadata table. This is unexpected.
+        throw new RuntimeException(
+            "Saw no table ids in ZK but did see table ids in metadata table: " + tableIdsSeen);
+      }
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);

Review Comment:
   ```suggestion
       tableIdsMustHaveSeen = Sets.difference(tableIdsMustHaveSeen, tableIdsSeen);
   ```
   Will need to use this instead of `removeAll` if you decide to use `Sets.intersection` to create `tableIdsMustHaveSeen` (as suggested above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] milleruntime commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
milleruntime commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r956139980


##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {

Review Comment:
   ```suggestion
     public void testMissingTableIds() {
   ```



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));
+
+    String msg = assertThrows(RuntimeException.class, () -> gca.collect(gce)).getMessage();
+    assertTrue((msg.contains("[b, d]") || msg.contains("[d, b]"))
+        && msg.contains("Saw table IDs in ZK that were not in metadata table:"), msg);
+  }
+
+  // below are tests for potential failure conditions of the GC process. Some of these cases were
+  // observed on clusters. Some were hypothesis based on observations. The result was that
+  // candidate entries were not removed when they should have been and therefore files were
+  // removed from HDFS that were actually still in use
+
+  private Set<TableId> makeUnmodifiableSet(String... args) {
+    Set<TableId> t = new HashSet<>();
+    for (String arg : args) {
+      t.add(TableId.of(arg));
+    }
+    return Collections.unmodifiableSet(t);
+  }
+
+  @Test
+  public void testNormalGCRun() {
+    // happy path, no tables added or removed during this portion and all the tables checked
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddle() {
+    // table was added during this portion and we don't see it, should be fine
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddleTwo() {
+    // table was added during this portion and we DO see it
+    // Means table was added after candidates were grabbed, so there should be nothing to remove
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddle() {
+    // table was deleted during this portion and we don't see it
+    // this mean any candidates from the deleted table wil stay on the candidate list
+    // and during the delete step they will try to removed
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddleTwo() {
+    // table was deleted during this portion and we DO see it
+    // this mean candidates from the deleted table may get removed from the candidate list
+    // which should be ok, as the delete table function should be responsible for removing those
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testMissEntireTable() {
+    // this test simulates missing an entire table when looking for what files are in use
+    // if you add custom splits to the metadata at able boundaries, this can happen with a failed
+    // scan
+    // recall the ~tab:~pr for this first entry of a new table is empty, so there is now way to
+    // check the prior row. If you split a couple of tables in the metadata the table boundary
+    // , say table ids 2,3,4, and then miss scanning table 3 but get 4, it is possible other
+    // consistency checks will miss this

Review Comment:
   Each of these comments describing what the tests do could be put in the javadoc of the test. I think this would help fix the formatting as well.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   *
+   */

Review Comment:
   Should put something here...
   
   ```suggestion
     /**
      * Double check no tables were missed during GC.
      */
   ```



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan

Review Comment:
   ```suggestion
       // add 2 more table references that will not be seen by the scan
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974650660


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +180,47 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          try {
+            byte[] state = zr.getData(statePath, null, null);
+            if (state == null) {
+              tableState = TableState.UNKNOWN;
+            } else {
+              tableState = TableState.valueOf(new String(state, UTF_8));
+            }
+          } catch (KeeperException e) {
+            if (e.code() == Code.NONODE) {
+              tableState = TableState.UNKNOWN;
+            }
+            throw e;

Review Comment:
   Yeah, I think I'm rushing.....



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970717082


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   Implemented in a61969020f7ca916465811bac22b52874c3f02cd.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977502934


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   I added a comment in 18b0c28, right above where this error is thrown, explaining that throwing this error does not terminate the GC process. It only terminates this loop and then gets logged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975446459


##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -122,6 +158,28 @@ public void incrementInUseStat(long i) {}
     public Iterator<Entry<String,Status>> getReplicationNeededIterator() {
       return filesToReplicate.entrySet().iterator();
     }
+
+    @Override
+    public Set<TableId> getCandidateTableIDs() {
+      if (level == Ample.DataLevel.ROOT) {
+        return Set.of(RootTable.ID);
+      } else if (level == Ample.DataLevel.METADATA) {
+        return Collections.singleton(MetadataTable.ID);
+      } else if (level == Ample.DataLevel.USER) {
+        tableIds.remove(MetadataTable.ID);
+        tableIds.remove(RootTable.ID);
+        getTableIDs().forEach((k, v) -> {
+          if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+            // Don't return tables that are NEW, DELETING, or in an
+            // UNKNOWN state.
+            tableIds.add(k);
+          }
+        });
+        return tableIds;

Review Comment:
   It would be good to avoid modifying the instance var tableIds.
   
   ```suggestion
           Set<TableId> tableIds = new HashSet<>();
           getTableIDs().forEach((k, v) -> {
             if (v == TableState.ONLINE || v == TableState.OFFLINE) {
               // Don't return tables that are NEW, DELETING, or in an
               // UNKNOWN state.
               tableIds.add(k);
             }
           });
           tableIds.remove(MetadataTable.ID);
           tableIds.remove(RootTable.ID);
           return tableIds;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975761245


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +180,44 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          try {
+            byte[] state = zr.getData(statePath, null, null);
+            if (state == null) {
+              tableState = TableState.UNKNOWN;
+            } else {
+              tableState = TableState.valueOf(new String(state, UTF_8));
+            }
+          } catch (NoNodeException e) {
+            tableState = TableState.UNKNOWN;
+          }
+          tids.put(tableId, tableState);
+        }
+        return tids;
+      } catch (KeeperException e) {
+        retries++;
+        if (ioe == null) {
+          ioe = new UncheckedIOException(new IOException("Error getting table ids from ZooKeeper"));

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;
+    } else {
+      throw new IllegalArgumentException("Unexpected Table in GC Env: " + this.level.name());

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969964642


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   So, this PR is a port of https://github.com/apache/accumulo/pull/2293 to 2.1. Looking back through the comments there, this method is used solely to check that the tableIds before and after the scan are the same. Getting the candidates for removal is done in another method which I think has not changed. Looked at the other PR, you commented on it and said it looked ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r957281388


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -167,7 +172,20 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+        continue;

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,29 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return true if this for the RootTable, usually setup in the constructor
+   *
+   * @return true if the GCE is for the RootTable
+   */
+  boolean isRootTable();
+
+  /**
+   * Return true if this for the MetadataTable, usually setup in the constructor
+   *
+   * @return true if the GCE is for the MetadataTable
+   */
+  boolean isMetadataTable();
+
+  /**
+   * Return a list of TableIDs for which we are considering deletes. For the root table this would
+   * be the metadata table. For the metadata table, this would be all of the other tables in the

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969621782


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   This feels a bit off.  Think it might need to be the following.
   
   ```suggestion
       if (level == ROOT) {
          return Set.of(RootTable.ID);
       } else if(level == METADATA){
         return Set.of(MetadataTable.ID);
       } else if (level == USER) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] DomGarguilo commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r956209314


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   *
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);

Review Comment:
   ```suggestion
       Set<TableId> tableIdsMustHaveSeen = Sets.intersection(tableIdsBefore, tableIdsAfter);
   ```
   Could use this here to make it a bit more concise and match the comment.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   *
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      if (!tableIdsBefore.isEmpty() && tableIdsAfter.isEmpty()) {
+        throw new RuntimeException("ZK returned no table ids after scanning for references,"
+            + " maybe all the tables were deleted");
+      } else {
+        // we saw no table ids in ZK but did in the metadata table. This is unexpected.
+        throw new RuntimeException(
+            "Saw no table ids in ZK but did see table ids in metadata table: " + tableIdsSeen);
+      }
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);

Review Comment:
   ```suggestion
       tableIdsMustHaveSeen = Sets.difference(tableIdsMustHaveSeen, tableIdsSeen);
   ```
   Will need to use this instead of remove if you decide to use `Sets.intersection` to create `tableIdsMustHaveSeen` (as suggested above)



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));

Review Comment:
   ```suggestion
       gce.tableIds.addAll(makeUnmodifiableSet("b", "d"));
   ```
   Could use this method here



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));
+
+    String msg = assertThrows(RuntimeException.class, () -> gca.collect(gce)).getMessage();
+    assertTrue((msg.contains("[b, d]") || msg.contains("[d, b]"))
+        && msg.contains("Saw table IDs in ZK that were not in metadata table:"), msg);
+  }
+
+  // below are tests for potential failure conditions of the GC process. Some of these cases were
+  // observed on clusters. Some were hypothesis based on observations. The result was that
+  // candidate entries were not removed when they should have been and therefore files were
+  // removed from HDFS that were actually still in use
+
+  private Set<TableId> makeUnmodifiableSet(String... args) {
+    Set<TableId> t = new HashSet<>();
+    for (String arg : args) {
+      t.add(TableId.of(arg));
+    }
+    return Collections.unmodifiableSet(t);

Review Comment:
   ```suggestion
       return Arrays.stream(args).map(TableId::of).collect(Collectors.toUnmodifiableSet());
   ```



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));
+
+    String msg = assertThrows(RuntimeException.class, () -> gca.collect(gce)).getMessage();
+    assertTrue((msg.contains("[b, d]") || msg.contains("[d, b]"))
+        && msg.contains("Saw table IDs in ZK that were not in metadata table:"), msg);
+  }
+
+  // below are tests for potential failure conditions of the GC process. Some of these cases were
+  // observed on clusters. Some were hypothesis based on observations. The result was that
+  // candidate entries were not removed when they should have been and therefore files were
+  // removed from HDFS that were actually still in use
+
+  private Set<TableId> makeUnmodifiableSet(String... args) {
+    Set<TableId> t = new HashSet<>();
+    for (String arg : args) {
+      t.add(TableId.of(arg));
+    }
+    return Collections.unmodifiableSet(t);
+  }
+
+  @Test
+  public void testNormalGCRun() {
+    // happy path, no tables added or removed during this portion and all the tables checked
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddle() {
+    // table was added during this portion and we don't see it, should be fine
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddleTwo() {
+    // table was added during this portion and we DO see it
+    // Means table was added after candidates were grabbed, so there should be nothing to remove
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddle() {
+    // table was deleted during this portion and we don't see it
+    // this mean any candidates from the deleted table wil stay on the candidate list
+    // and during the delete step they will try to removed
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddleTwo() {
+    // table was deleted during this portion and we DO see it
+    // this mean candidates from the deleted table may get removed from the candidate list
+    // which should be ok, as the delete table function should be responsible for removing those
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testMissEntireTable() {
+    // this test simulates missing an entire table when looking for what files are in use
+    // if you add custom splits to the metadata at able boundaries, this can happen with a failed
+    // scan
+    // recall the ~tab:~pr for this first entry of a new table is empty, so there is now way to
+    // check the prior row. If you split a couple of tables in the metadata the table boundary
+    // , say table ids 2,3,4, and then miss scanning table 3 but get 4, it is possible other
+    // consistency checks will miss this
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "3", "4");
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.startsWith("Saw table IDs in ZK that were not in metadata table:"));
+
+  }
+
+  @Test
+  public void testZKHadNoTables() {
+    // this test simulates getting nothing from ZK for table ids, which should not happen,
+    // but just in case let's test
+    Set<TableId> tablesBefore = makeUnmodifiableSet();
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2");
+    Set<TableId> tablesAfter = makeUnmodifiableSet();
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.startsWith("Saw no table ids in ZK but did see table ids in metadata table:"));
+  }
+
+  @Test
+  public void testMissingTableAndTableAdd() {
+    // simulates missing a table when checking references, and a table being added
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "3", "4");
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.equals("Saw table IDs in ZK that were not in metadata table:  [3]"));
+  }
+
+  @Test
+  public void testMissingTableAndTableDeleted() {
+    // simulates missing a table when checking references, and a table being deleted
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "3");
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.equals("Saw table IDs in ZK that were not in metadata table:  [3]"));

Review Comment:
   ```suggestion
       assertEquals("Saw table IDs in ZK that were not in metadata table:  [3]", msg);
   ```
   Can be simplified to assertEquals



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));
+
+    String msg = assertThrows(RuntimeException.class, () -> gca.collect(gce)).getMessage();
+    assertTrue((msg.contains("[b, d]") || msg.contains("[d, b]"))
+        && msg.contains("Saw table IDs in ZK that were not in metadata table:"), msg);
+  }
+
+  // below are tests for potential failure conditions of the GC process. Some of these cases were
+  // observed on clusters. Some were hypothesis based on observations. The result was that
+  // candidate entries were not removed when they should have been and therefore files were
+  // removed from HDFS that were actually still in use
+
+  private Set<TableId> makeUnmodifiableSet(String... args) {
+    Set<TableId> t = new HashSet<>();
+    for (String arg : args) {
+      t.add(TableId.of(arg));
+    }
+    return Collections.unmodifiableSet(t);
+  }
+
+  @Test
+  public void testNormalGCRun() {
+    // happy path, no tables added or removed during this portion and all the tables checked
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddle() {
+    // table was added during this portion and we don't see it, should be fine
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddleTwo() {
+    // table was added during this portion and we DO see it
+    // Means table was added after candidates were grabbed, so there should be nothing to remove
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddle() {
+    // table was deleted during this portion and we don't see it
+    // this mean any candidates from the deleted table wil stay on the candidate list
+    // and during the delete step they will try to removed
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddleTwo() {
+    // table was deleted during this portion and we DO see it
+    // this mean candidates from the deleted table may get removed from the candidate list
+    // which should be ok, as the delete table function should be responsible for removing those
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testMissEntireTable() {
+    // this test simulates missing an entire table when looking for what files are in use
+    // if you add custom splits to the metadata at able boundaries, this can happen with a failed
+    // scan
+    // recall the ~tab:~pr for this first entry of a new table is empty, so there is now way to
+    // check the prior row. If you split a couple of tables in the metadata the table boundary
+    // , say table ids 2,3,4, and then miss scanning table 3 but get 4, it is possible other
+    // consistency checks will miss this
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "3", "4");
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.startsWith("Saw table IDs in ZK that were not in metadata table:"));
+
+  }
+
+  @Test
+  public void testZKHadNoTables() {
+    // this test simulates getting nothing from ZK for table ids, which should not happen,
+    // but just in case let's test
+    Set<TableId> tablesBefore = makeUnmodifiableSet();
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2");
+    Set<TableId> tablesAfter = makeUnmodifiableSet();
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.startsWith("Saw no table ids in ZK but did see table ids in metadata table:"));
+  }
+
+  @Test
+  public void testMissingTableAndTableAdd() {
+    // simulates missing a table when checking references, and a table being added
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("1", "2", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "3", "4");
+
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+    String msg = assertThrows(RuntimeException.class,
+        () -> gca.ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter)).getMessage();
+    assertTrue(msg.equals("Saw table IDs in ZK that were not in metadata table:  [3]"));

Review Comment:
   ```suggestion
       assertEquals("Saw table IDs in ZK that were not in metadata table:  [3]", msg);
   ```
   Can be simplified to assertEquals



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974623516


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   The ZK javadoc for getData could be improved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969790741


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      if (!tableIdsBefore.isEmpty() && tableIdsAfter.isEmpty()) {
+        throw new RuntimeException("ZK returned no table ids after scanning for references,"
+            + " maybe all the tables were deleted");
+      } else {
+        // we saw no table ids in ZK but did in the metadata table. This is unexpected.
+        throw new RuntimeException(
+            "Saw no table ids in ZK but did see table ids in metadata table: " + tableIdsSeen);
+      }
+    }

Review Comment:
   There are few possible interleavings that could make different empty sets possible under normal circumstances.
   
   The following would result in tablesIdsBefore and tableIdAfter being empty while tableIdsSeen is not empty.
   
    1. tableIdsBefore is read from ZK and is empty
    2. a new table is created 
    3. tableIdsSeen is read from metadata table and is not empty
    4. the table single is deleted
    5. tableIdsAfter is read from ZK and is empty
   
   The following will result in only tableIdsBefore being empty
   
    1. tableIdsBefore is read from ZK and is empty
    2. a new table is created 
    3. tableIdsSeen is read from metadata table and is not empty
    4. tableIdsAfter is read from ZK and is not empty
   
   The following will result in only tableIdsAfter being empty
   
    1. a new table is created 
    2. tableIdsBefore is read from ZK and is not empty
    4. tableIdsSeen is read from metadata table and is not empty
    5. the table single is deleted
    6. tableIdsAfter is read from ZK and is empty
   
   Given that there are multiple transient situations that lead to different empty sets I think we can have one error message like the following.  Its long, but I didn't attempt to wrap as I typed it up in GH.
   
   ```suggestion
       if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
           throw new RuntimeException("Garbage collection will not proceed because table ids were seen in the metadata table and none were seen Zookeeper.  This can have two causes.  First, total number of tables going to/from zero during a GC cycle will cause this. Second, it could be caused by corruption of the metadata table and/or Zookeeper. Only the second cause is problematic, but there is not way to distinguish between the two causes so this GC cycle will not proceed.  The first cause should be transient and one would not expect to see this message repeated in subsequent GC cycles. ");
       }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969895280


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      if (!tableIdsBefore.isEmpty() && tableIdsAfter.isEmpty()) {
+        throw new RuntimeException("ZK returned no table ids after scanning for references,"
+            + " maybe all the tables were deleted");
+      } else {
+        // we saw no table ids in ZK but did in the metadata table. This is unexpected.
+        throw new RuntimeException(
+            "Saw no table ids in ZK but did see table ids in metadata table: " + tableIdsSeen);
+      }
+    }

Review Comment:
   Addressed in 8a21067b7e41e2d3d1be8c9e7b1e8ee4e7f6777e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1246665001

   @keith-turner - I implemented your changes. I kind of expected a test failure, but that didn't happen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972356614


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new RuntimeException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);
+    }

Review Comment:
   > The canonical determination of whether a table exists or not is that it has an entry in ZK... this is created before metadata entries, and is the last thing removed when a table is deleted.
   
   Good catch, we need to consider table states to avoid this problem.  I mentioned table states in #1377, but its been so long I had completely forgotten about that edge case and I did not reread the issue until now.
   
   When a table is created the following is done.
   
    1. table is put in ZK w/ TableState.NEW
    2. metadata table is populated
    3. tables state is set to TableState.ONLINE or TableState.OFFLINE
   
   When a table is deleted the following is done.
   
    1. Table state is set to TableState.DELETING
    2. entries are removed from metadata table
    3. entries are removed from ZK
   
   So from the perspective of GC, if we see a table with a state of TableState.ONLINE or TableState.OFFLINE before and after scanning the metadata table, then it must be seen in the metadata table unless there is a problem.
   
   We need to get a `Map<TableId,TableState>` to properly do this check.
   
   
   
     



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977593844


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   > We specifically have the initial GC delay shorter (30s) than the cycle delay (5m), which makes this more likely to occur if the first table they create is around the 30 second mark when the first GC cycle is starting to run.
   
   I was thinking through this situation wondering how probable it was and in the process determined that this particular situation is probably not possible.  For this case there would be no GC candidates.  I wasn't sure what the garbage collector did when there are no candidates.  I looked and found the following code.
   
   https://github.com/apache/accumulo/blob/3e1225f0db469f8fa76be7cc4e72961696b57a2f/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L314-L333
   
   Looking at that, if `candidatesIter` does not initially have a next then it will not scan for references and run the validation added in this PR.  I was happy to find the GC does no unnecessary work when there are no candidates.
   
   A user would need to at least do the following to possibly bump into this message.
   
    * create a table
    * initiate activity that creates gc candidates like major compactions
    * delete the table
   
   The delete would have to overlap in time with the GC reading tables ids from ZK and reading references from the metadata table.  For a system with hardly any tables or tablets reading them will be extremely fast making the probability of a one off user operation overlapping extremely small.  A user has to delete a table within a 5ms to 10ms time window that happens after 30s and then every 5 min to see this message.  So it seems unlikely someone would bump into this.  It would  be good to run through a situation like the above a few times and see what happens, I will give that a try.
   
   Where this message will likely be seen is on a system that has lots of tables and tablets (makes GC cycle take longer) and frequently goes to/from zero total tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977593844


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   > We specifically have the initial GC delay shorter (30s) than the cycle delay (5m), which makes this more likely to occur if the first table they create is around the 30 second mark when the first GC cycle is starting to run.
   
   I was thinking through this situation wondering how probable and in the process determined that this particular situation is probably not possible.  For this case there would be no GC candidates.  I wasn't sure what the garbage collector did when there are no candidates.  I looked and found the following code.
   
   https://github.com/apache/accumulo/blob/3e1225f0db469f8fa76be7cc4e72961696b57a2f/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L314-L333
   
   Looking at that, if `candidatesIter` does not initially have a next then it will not scan for references and run the validation added in this PR.  I was happy to find the GC does no unnecessary work when there are no candidates.
   
   A user would need to at least do the following to possibly bump into this message.
   
    * create a table
    * initiate activity that creates gc candidates like major compactions
    * delete the table
   
   The delete would have to overlap in time with the GC reading tables ids from ZK and reading references from the metadata table.  For a system with hardly any tables or tablets reading them will be extremely fast making the probability of a one off user operation overlapping extremely small.  A user has to delete a table within a 5ms to 10ms time window that happens after 30s and then every 5 min to see this message.  So it seems unlikely someone would bump into this.  It would  be good to run through a situation like the above a few times and see what happens, I will give that a try.
   
   Where this message will likely be seen is on a system that has lots of tables and tablets (makes GC cycle take longer) and frequently goes to/from zero total tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974440649


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   That retry at the end seemed more like something for zookeeper errors.  Was thinking the nonode exception is not a zookeeper problem, but a concurrency issue that can  be ignored (no need to redo everything).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1248420788

   > So, I'm not convinced this is the right approach.
   
   What's the implication of this statement? Are you saying that we need to go back to the drawing board? If I implement your suggested changes, is this good to go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969742234


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   This change does not match the intended behavior described in the javadoc for the method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969777428


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }

Review Comment:
   Addressed in c592f3d13eec0a2ca97c252d3621bcd190bab876



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973226315


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new RuntimeException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);
+    }

Review Comment:
   Addressed the generic RTE in 1f8d37305789fa1540158e32b54da86a7af3ae87. Implemented the table state checking in the same commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973226626


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {

Review Comment:
   I think I got this covered in 1f8d37305789fa1540158e32b54da86a7af3ae87. Take a look if you can.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974625599


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +180,47 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          try {
+            byte[] state = zr.getData(statePath, null, null);
+            if (state == null) {
+              tableState = TableState.UNKNOWN;
+            } else {
+              tableState = TableState.valueOf(new String(state, UTF_8));
+            }
+          } catch (KeeperException e) {
+            if (e.code() == Code.NONODE) {
+              tableState = TableState.UNKNOWN;
+            }
+            throw e;

Review Comment:
   Should this be if/else?
   
   ```suggestion
               if (e.code() == Code.NONODE) {
                 tableState = TableState.UNKNOWN;
               } else {
                  throw e;
                }
   ```
   
   May not matter if we change to use NoNodeException that I commented on in other thread.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975509167


##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -122,6 +158,28 @@ public void incrementInUseStat(long i) {}
     public Iterator<Entry<String,Status>> getReplicationNeededIterator() {
       return filesToReplicate.entrySet().iterator();
     }
+
+    @Override
+    public Set<TableId> getCandidateTableIDs() {
+      if (level == Ample.DataLevel.ROOT) {
+        return Set.of(RootTable.ID);
+      } else if (level == Ample.DataLevel.METADATA) {
+        return Collections.singleton(MetadataTable.ID);
+      } else if (level == Ample.DataLevel.USER) {
+        tableIds.remove(MetadataTable.ID);
+        tableIds.remove(RootTable.ID);
+        getTableIDs().forEach((k, v) -> {
+          if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+            // Don't return tables that are NEW, DELETING, or in an
+            // UNKNOWN state.
+            tableIds.add(k);
+          }
+        });
+        return tableIds;

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1252518193

   @ctubbsii  - do these recent changes resolve the changes you requested?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975761459


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1249134891

    >  It feels like we're shooting in the dark, adding a sanity check against a problem that is not well-defined.
   
   I think the validation outlined in #1377 is very well defined: If table ids w/ a certain state were seen in ZK before and after metadata scan then we must  see them in metadata OR something is really really wrong.   The less the GC trust that the persisted data it reads is correct and the more it tries to verify what it can about that data the more robust GC will be.   The GC already does a good bit of checks on the metadata table as it reads it.  Checking the metadata table against ZK will make the verification more thorough.
   
   In addition to the check outlined in #1377, another check was added in #2293 for the case where table ids were seen in metadata table and none were seen in ZK.   This situation can legitimately happen when the total number of tables goes to/from zero concurrently with a GC cycle.  This situation could also happen when ZK read is silently failing and returning no data.  There is no way to distinguish between the two causes.  If the system is actually working correctly, then the legitimate case would be transient.   We could defer this second check and open an issue to do it another PR where we try to figure out how to avoid the false positive or make the check less noisy at first (like always skip GC cycle but log info at first and warn/error on subsequent cycles if seen again).  IMO it would be good to leave this second check as it is in the current PR because it can always be removed in bug fix release.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] milleruntime commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
milleruntime commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r956173857


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -167,7 +172,20 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+        continue;

Review Comment:
   ```suggestion
   ```



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,29 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return true if this for the RootTable, usually setup in the constructor
+   *
+   * @return true if the GCE is for the RootTable
+   */
+  boolean isRootTable();
+
+  /**
+   * Return true if this for the MetadataTable, usually setup in the constructor
+   *
+   * @return true if the GCE is for the MetadataTable
+   */
+  boolean isMetadataTable();
+
+  /**
+   * Return a list of TableIDs for which we are considering deletes. For the root table this would
+   * be the metadata table. For the metadata table, this would be all of the other tables in the

Review Comment:
   ```suggestion
      * be the metadata table. For the metadata table, this would be the other tables in the
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] milleruntime commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
milleruntime commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1228664340

   I think all the logic is sound for this improvement but there is just some minor cleanup that could be done, most likely due to cherry picking and/or multiple contributors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969621782


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   ```suggestion
       if (level == ROOT) {
          return Set.of(RootTable.ID);
       } else if(level == METADATA){
         return Set.of(MetadataTable.ID);
       } else if (level == USER) {
   ```



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }

Review Comment:
   It would be nice to remove these if possible, seems like they are only used in test.  Could just make test override/implement the method to get table ids.  If it is needed for test I would suggest adding a getDataLevel() method.
   
   ```suggestion
   ```



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {

Review Comment:
   Using level in the name seems better to me.
   
   ```suggestion
     /**
      * @return the tables id for the current data level
      */
     public Set<TableId> getLevelTableIDs() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973226980


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   Moved javadoc around as suggested in  1f8d37305789fa1540158e32b54da86a7af3ae87



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the

Review Comment:
   Addressed in 1f8d37305789fa1540158e32b54da86a7af3ae87



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1247044628

   Nevermind, I grabbed the [diff](https://github.com/keith-turner/accumulo/commit/78cf05d9af90bb7ba218a17cda6755df80127702.diff) and applied manually
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969605447


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }
+    }

Review Comment:
   Maybe the new `getCandidateTableIDs()` method is not needed and this method could be modified.  Looking around at the code there is only one place this is used currently and I think this change would be good for that code.   The existing code that uses it may not be correct for metadata and root  levels (but I think luckily it may not matter).
   
   ```suggestion
       switch(level){
       case ROOT:
           return Set.of(RootTable.ID);
       case METADATA:
           return Set.of(MetadataTable.ID);
       case USER:
         final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
         final ZooReader zr = context.getZooReader();
         final Set<TableId> tids = new HashSet<>();
         while (true) {
           try {
             zr.sync(tablesPath);
             zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
             return tids;
           } catch (KeeperException | InterruptedException e) {
             log.error("Error getting tables from ZooKeeper, retrying", e);
             UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
           }
         }
       default:
          throw new IllegalArgumentException("unknown level "+level);
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970062282


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {

Review Comment:
   Looking back at the 1.10 PR this method is called getCandidateTableIDs... so would probably want to keep the same method name in this branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972860972


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   Maybe its confusing because it goes more with `GCRun.getCandidateTableIDs()`.  Could do the following to make it less confusing.
   
   * Move the current javadoc on the interface to `GCRun.getCandidateTableIDs()` and replace `when operating on` with `the current data level is`
   * Update the javadoc on this interface to make it generic like : 'Return a list of all TableIDs that should be seen in {@link #getReferences()} at the current time.  Immediately after this method returns the information it produced may be out of date relative to {@link #getReferences()} ... {@see GCRun#getCandidateTableIDs}`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969617710


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }
+    }

Review Comment:
   Thinking it makes sense to have two methods for clarity though, and resolving this comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r910110142


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -457,4 +459,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {
+      Set<TableId> tableIds = new HashSet<>(getTableIDs());

Review Comment:
   do you think this is a better implementation of `getTableIDs()`: 
   ```
     @Override
     public Set<TableId> getTableIDs() {
       final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
       final ZooReader zr = context.getZooReader();
       final Set<TableId> tids = new HashSet<>();
       while (true) {
         try {
           zr.sync(tablesPath);
           zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
           return tids;
         } catch (KeeperException | InterruptedException e) {
           log.error("Error getting tables from ZooKeeper, retrying", e);
           UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
           continue;
         }
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972434828


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {

Review Comment:
   Elsewhere there was a comment about using table states.  If we had tables states from ZK, then something like the following would avoid race conditions.
   
   ```suggestion
     protected void ensureAllTablesChecked(Map<TableId,TableState> tableIdsAndStatesBefore, Set<TableId> tableIdsSeen,
         Map<TableId,TableState> tableIdsAndStatesAfter) {
         
         //TODO maybe copy maps before deleting from them
         //only consider table ids with states ONLINE and OFFLINE because entries must exist in the metadata table when a table is in these states.  Look at how table states are set relative to metadata updates in table creation and deletion for more details.
         tableIdsAndStatesBefore.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         tableIdsAndStatesAfter.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         
         var tableIdsBefore = tableIdsAndStatesBefore.keySet();
         var tableIdsAfter = tableIdsAndStatesAfter.keySet();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973779278


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +502,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });

Review Comment:
   ```suggestion
         getTableIDs().forEach((k, v) -> {
           if (v == TableState.ONLINE || v == TableState.OFFLINE) {
             // Don't return tables that are NEW, DELETING, or in an
             // UNKNOWN state.
             tableIds.add(k);
           }
         });
         tableIds.remove(MetadataTable.ID);
         tableIds.remove(RootTable.ID);
   
   ```



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   Not completely sure what getData will do if the node does not exist.  If it returns null then ignore the following.  If it throws a NNE then could do the following.
   
   ```suggestion
             byte[] state = null;
             try {
                state = zr.getData(statePath, null, null);
              } catch (NoNodeException e) {
                  // table was probably deleted after getting list of children... so let it end up as an UNKNOWN state
              }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972251517


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the

Review Comment:
   It doesn't return a list. It returns a set.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   The term "operating on" used throughout this javadoc is very confusing. What does it mean to be "operating on" a particular level? I think we could choose a better verb to understand what's going on. Maybe "when gathering file candidates for deletion for DataLevel.USER, this will return a list of user table IDs".



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new RuntimeException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);
+    }

Review Comment:
   Please don't introduce new code with generic RTE being thrown. A more specific RTE is always available and likely more appropriate.
   
   Also, this situation is likely to be common if creating and deleting a lot of tables. The canonical determination of whether a table exists or not is that it has an entry in ZK... this is created before metadata entries, and is the last thing removed when a table is deleted. So, it's quite normal and routine for a table ID not to exist in the metadata table if it was just created or deleted. This is not necessarily a problem and shouldn't throw an exception... or we're going to see this a lot. We can probably start over gathering candidates... but we shouldn't just crash.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,20 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>(getTableIDs());
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;
+    } else {
+      throw new RuntimeException("Unexpected Table in GC Env: " + this.level.name());

Review Comment:
   Choose more specific RTE



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -135,10 +137,14 @@ private SortedMap<String,String> makeRelative(Collection<String> candidates) {
 
   private void removeCandidatesInUse(GarbageCollectionEnvironment gce,
       SortedMap<String,String> candidateMap) {
+
+    Set<TableId> tableIdsBefore = gce.getCandidateTableIDs();
+    Set<TableId> tableIdsSeen = new HashSet<>();
     var refStream = gce.getReferences();
     Iterator<Reference> iter = refStream.iterator();
     while (iter.hasNext()) {
       Reference ref = iter.next();
+      tableIdsSeen.add(ref.getTableId());

Review Comment:
   This `var` obscures the type and variable name `refStream` implies it's a Stream. But, then we call `.iterator()` on it. If it's actually a Stream, then there are better ways to manipulate it than doing `while (iter.hasNext()) { iter.next(); }`.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }

Review Comment:
   First, this shouldn't loop if it's explicitly interrupted. The process should probably die in that case. Second, if this does loop because of a KeeperException, the set of table IDs that was constructed prior to the `while (true)` loop could include deleted tables. Is that okay/intended? If so, a small inline comment explaining why that's okay might be useful for future onlookers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1248800663

   > > So, I'm not convinced this is the right approach.
   > 
   > What's the implication of this statement? Are you saying that we need to go back to the drawing board? If I implement your suggested changes, is this good to go?
   
   Unless the edge cases for table creation/deletion are addressed (like paying attention to table states, like Keith mentioned), then this should definitely not be merged as is. And, I wouldn't recommend its inclusion in any forked version of 1.10 either, because of the potential harmful side-effects I mentioned.
   
   If the table creation/deletion edge cases are addressed, I would have to re-review. I've only reviewed what's there, not what code changes have yet to be proposed.
   
   I'm not sure what "go[ing] back to the drawing board" would entail. It feels like we're shooting in the dark, adding a sanity check against a problem that is not well-defined.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973227199


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }

Review Comment:
   Addressed in 1f8d37305789fa1540158e32b54da86a7af3ae87



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975662704


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;

Review Comment:
   I think we have had this discussion before. I avoid streams on purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974537032


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   From what I see, it says: `A KeeperException with error code KeeperException.NoNode will be thrown if no node with the given path exists.` This is a little different than what you have above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974662993


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   Updated in f2cee32fb6117515747872e0b310dde4060702e3 based on your latest information



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r957281677


##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972434828


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {

Review Comment:
   Elsewhere there was a comment about using table states.  If we had tables states from ZK, then something like the following would avoid race conditions.
   
   ```suggestion
     protected void ensureAllTablesChecked(Map<TableId,TableState> tableIdsAndStatesBefore, Set<TableId> tableIdsSeen,
         Map<TableId,TableState> tableIdsAndStatesAfter) {
         
         //TODO maybe copy maps before deleting from them
         //only consider table ids with states ONLINE and OFFLINE because entries must exist in the metadata table when a table is in these states.  Look at how table states are set relative to metadata updates in table creation and deletion for more details.  For the other table states, its expected that entries may or may not exist in the metadata table.
         tableIdsAndStatesBefore.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         tableIdsAndStatesAfter.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         
         var tableIdsBefore = tableIdsAndStatesBefore.keySet();
         var tableIdsAfter = tableIdsAndStatesAfter.keySet();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972434828


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {

Review Comment:
   Elsewhere there was a comment about using table states.  If we had tables states from ZK, then something like the following would avoid race conditions.
   
   ```suggestion
     protected void ensureAllTablesChecked(Map<TableId,TableState> tableIdsAndStatesBefore, Set<TableId> tableIdsSeen,
         Map<TableId,TableState> tableIdsAndStatesAfter) {
         
         //TODO maybe copy maps before deleting from them
         //only consider tables states ONLINE and OFFLINE because entries must exist in the metadata table when a table is in these states.  Look at how table states are set relative to metadata updates in table creation and deletion for more details.
         tableIdsAndStatesBefore.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         tableIdsAndStatesAfter.values().removeIf(tabState - > tabState != ONLINE && tabState != OFFLINE);
         
         var tableIdsBefore = tableIdsAndStatesBefore.keySet();
         var tableIdsAfter = tableIdsAndStatesAfter.keySet();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975587422


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;

Review Comment:
   Instead of forEach, this should use a predicate on the stream and use a stream collector. Something like the following, which is shorter and cleaner. Using forEach to populate a collection constructed prior to the loop is rarely best practice.
   
   ```suggestion
       return getTableIDs().entrySet().stream()
         .filter(e -> e.getValue() == TableState.ONLINE || e.getValue() == TableState.OFFLINE)
         .map(Entry::getKey)
         .filter(k -> !k.equals(MetadataTable.ID) && !k.equals(RootTable.ID))
         .collect(Collectors.toSet());
   ```
   



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);

Review Comment:
   Why does one return a Set.of(), and the other return a Collections.singleton()? Seems unnecessarily inconsistent.
   
   ```suggestion
         return Set.of(RootTable.ID);
       } else if (level == DataLevel.METADATA) {
         return Set.of(MetadataTable.ID);
   ```



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -80,8 +95,10 @@ public Stream<Reference> getReferences() {
     }
 
     @Override
-    public Set<TableId> getTableIDs() {
-      return tableIds;
+    public Map<TableId,TableState> getTableIDs() {
+      HashMap<TableId,TableState> results = new HashMap<>();
+      tableIds.forEach((t) -> results.put(t, TableState.ONLINE));
+      return results;

Review Comment:
   Again, forEach to populate a collection constructed prior to the loop is rarely best:
   
   ```suggestion
         return tableIds.stream().collect(Collectors.toMap(id -> id, id -> TableState.ONLINE));
   ```



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;
+    } else {
+      throw new IllegalArgumentException("Unexpected Table in GC Env: " + this.level.name());

Review Comment:
   Unexpected level, not unexpected table. Also, since DataLevel is an enum, it might make more sense to use a switch statement, and have this case be the default case, rather than use a series of if else statements.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +180,44 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          try {
+            byte[] state = zr.getData(statePath, null, null);
+            if (state == null) {
+              tableState = TableState.UNKNOWN;
+            } else {
+              tableState = TableState.valueOf(new String(state, UTF_8));
+            }
+          } catch (NoNodeException e) {
+            tableState = TableState.UNKNOWN;
+          }
+          tids.put(tableId, tableState);
+        }
+        return tids;
+      } catch (KeeperException e) {
+        retries++;
+        if (ioe == null) {
+          ioe = new UncheckedIOException(new IOException("Error getting table ids from ZooKeeper"));

Review Comment:
   IOException seems a bit forced here. IllegalStateException might be more appropriate, and wouldn't require this nesting of objects.



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -122,6 +158,29 @@ public void incrementInUseStat(long i) {}
     public Iterator<Entry<String,Status>> getReplicationNeededIterator() {
       return filesToReplicate.entrySet().iterator();
     }
+
+    @Override
+    public Set<TableId> getCandidateTableIDs() {
+      if (level == Ample.DataLevel.ROOT) {
+        return Set.of(RootTable.ID);
+      } else if (level == Ample.DataLevel.METADATA) {
+        return Collections.singleton(MetadataTable.ID);
+      } else if (level == Ample.DataLevel.USER) {
+        Set<TableId> tableIds = new HashSet<>();
+        getTableIDs().forEach((k, v) -> {
+          if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+            // Don't return tables that are NEW, DELETING, or in an
+            // UNKNOWN state.
+            tableIds.add(k);
+          }
+        });
+        tableIds.remove(MetadataTable.ID);
+        tableIds.remove(RootTable.ID);
+        return tableIds;
+      } else {
+        throw new IllegalArgumentException("unknown level " + level);
+      }

Review Comment:
   Not going to rewrite this at the moment, but it should be improved similar to my previous comments.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +507,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or

Review Comment:
   I still think this wording is vague. What does it mean to be "operating on" a particular DataLevel?



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   This case shouldn't kill the garbage collector. This is a normal case when a user recently initializes a cluster. We specifically have the initial GC delay shorter (30s) than the cycle delay (5m), which makes this more likely to occur if the first table they create is around the 30 second mark when the first GC cycle is starting to run. We don't want to make it more likely that first-time users are going to see the GC crash just because they created their first table in Accumulo!
   
   We should be able to skip this cycle with a warning rather than killing the GC with a RTE.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new IllegalStateException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);

Review Comment:
   I feel like we should either log or throw, but not both. If we throw, then we can rely on the caller to log the message. If we log, and throw, then we're probably getting duplicate error messages in the logs, that are not obviously the same problem.



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);

Review Comment:
   One formatted error message should be enough. There's no need for 4 separate log entries. Having 4 will age out other recently seen errors on the monitor page that shows recent logs faster than necessary, and in a multi-threaded system, these can be split up by logs from other threads, making it harder to parse and display these using automated tooling.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974133668


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   This should get picked up by the try-catch block that is already there, and then will retry. Do you think that is not sufficient?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] milleruntime commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
milleruntime commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r909832662


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -457,4 +459,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {
+      Set<TableId> tableIds = new HashSet<>(getTableIDs());

Review Comment:
   This is OK but just calling `getTableIDs()` is probably not enough on a highly active cluster. This just calls the API and gets whatever is cached at the time. 
   
   @ctubbsii comment from the original PR: "It can't detect tables that existed at the start but skipped, if the ZooCache wasn't up-to-date, and it can't detect skipping over any tables that were created or deleted during the scan. This is especially problematic if a new table was created as the result of a clone operation, which will duplicate all the file references for the new table."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969615888


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }
+    }

Review Comment:
   > The existing code that uses it may not be correct for metadata and root levels 
   
   Actually the existing code that uses it is probably correct if the current impl returns all table ids including root and metadata tables which I think it would as long as those are in ZK.  Limiting it to returning just the table ids for the current level should not harm the current usage. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1254852766

   Full IT passed except for the test failure identified in #2948


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977689648


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   > It would be good to run through a situation like the above a few times and see what happens, I will give that a try.
   
   @keith-turner - did you want me to wait to merge for this testing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977751451


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   While creating the first table might not trigger this because of the lack of gc candidate files, it's still true that users are likely creating tables soon after initialization, and my point is that going from zero (user) tables to non-zero is more likely to occur soon after initialization than at other times, even if that initial activity would need to be more involved in order to trigger this (such as creating a test table, flushing, deleting, then creating the first table that the first GC run sees).
   
   However, if this doesn't crash the GC, but merely triggers the current cycle to be skipped, then my concern goes away.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977751497


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   ok, great. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969617710


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }
+    }

Review Comment:
   Maybe it makes sense to have two methods for clarity though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969832993


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   I missed that javadoc, looking at it now it would need to be adjusted.  There are three data levels and we run GC for each level.  They are created here
   
   https://github.com/apache/accumulo/blob/1aeffcf0dca3e9797d3bfd75dd05b821885b1dd4/server/gc/src/main/java/org/apache/accumulo/gc/SimpleGarbageCollector.java#L207-L209
   
   So for each level we need to return the table IDs that would be seen in the metadata for that level.  Like the USER level stores its metadata in the metadata table, but we see user table ids in the metadata for that level.  For the METADATA level its stores its metadata in the root table and we see the metadata table id.  For the ROOT level it stores its metadata in ZK and we see the root table id in the metadata for that level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970593464


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {

Review Comment:
   I looked how the references are obtained and I think the suggestion is correct.  In GcRun when the DataLevel is USER it will read references from the metadata table which contains user table ids.  When the DataLevel is METADATA it will read references from the root table which will contain metadata table id.  When the DataLevel is ROOT it will read the single tablet metadata entry for the root table from ZK (this is new after 1.x) which will contain root table id.
   
   In 1.1 things operated a bit differently, there were two metadata tables to read from.  In 1.1 we would look at the table being read from an infer what kinds of table ids it contained.  Now there are two metadata tables and a single tablet metadata entry in ZK.  DataLevel is used  to specify to ample which type of metadata to read for these three data levels.
   
   Unrelated to this PR, looking at the GCRun ref code I would like to clean it up to remove the conditional on DataLevel.Root and just pass DataLevel to ample.  I think that is an indication of the a problem in the internal ample API.
   
   I made suggestion to update the javadoc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1247029109

   @keith-turner  - I like it. Can you create a PR against my branch and I will merge it in?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974622546


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   I think Zookeeper will create the more specific versions of the exception based on the error code.   So you can catch those.  Looked at the ZK source a bit and found the following.
   
   https://github.com/apache/zookeeper/blob/9b6ec9060604347fc996fe69dc33a21222fbd0c4/zookeeper-server/src/main/java/org/apache/zookeeper/ZooKeeper.java#L2028
   
   
   https://github.com/apache/zookeeper/blob/9b6ec9060604347fc996fe69dc33a21222fbd0c4/zookeeper-server/src/main/java/org/apache/zookeeper/KeeperException.java#L93



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974051291


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -171,8 +179,40 @@ public Stream<Reference> getReferences() {
   }
 
   @Override
-  public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+  public Map<TableId,TableState> getTableIDs() throws InterruptedException {
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    int retries = 1;
+    UncheckedIOException ioe = null;
+    while (retries <= 10) {
+      try {
+        zr.sync(tablesPath);
+        final Map<TableId,TableState> tids = new HashMap<>();
+        for (String table : zr.getChildren(tablesPath)) {
+          TableId tableId = TableId.of(table);
+          TableState tableState = null;
+          String statePath = context.getZooKeeperRoot() + Constants.ZTABLES + "/"
+              + tableId.canonical() + Constants.ZTABLE_STATE;
+          byte[] state = zr.getData(statePath, null, null);

Review Comment:
   I looked at the Accumulo code in ZooReader and the Zookeeper code its calling.  It seems like a KeeperException.NoNode will be thrown when the data node does not exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] milleruntime commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
milleruntime commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r924447272


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -457,4 +459,28 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public boolean isRootTable() {
+    return level == DataLevel.ROOT;
+  }
+
+  @Override
+  public boolean isMetadataTable() {
+    return level == DataLevel.METADATA;
+  }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (isRootTable()) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (isMetadataTable()) {
+      Set<TableId> tableIds = new HashSet<>(getTableIDs());

Review Comment:
   Yeah, we need something like this. There is a retryLoop() method in `ZooReader` that I think could be used here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975761015


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);
+      // maybe a scan failed?
+      throw new IllegalStateException(
+          "Saw table IDs in ZK that were not in metadata table:  " + tableIdsMustHaveSeen);

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }
+
+    // From that intersection, remove all the table ids that were seen.
+    tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+    // If anything is left then we missed a table and may not have removed rfiles references
+    // from the candidates list that are acutally still in use, which would
+    // result in the rfiles being deleted in the next step of the GC process
+    if (!tableIdsMustHaveSeen.isEmpty()) {
+      log.error("TableIDs before: " + tableIdsBefore);
+      log.error("TableIDs after : " + tableIdsAfter);
+      log.error("TableIDs seen  : " + tableIdsSeen);
+      log.error("TableIDs that should have been seen but were not: " + tableIdsMustHaveSeen);

Review Comment:
   Addressed in 0291190a8296327dd26af96b1299580fc29d54ad



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972860972


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   Maybe its confusing because it goes more with `GCRun.getCandidateTableIDs()`.  Could do the following to make it less confusing.
   
   * Move the current javadoc on the interface to `GCRun.getCandidateTableIDs()` and replace `when operating on` with `the current data level is`
   * Update the javadoc on this interface to make it generic like : `Return a set of all TableIDs that should be seen in {@link #getReferences()} at the current time.  Immediately after this method returns the information it produced may be out of date relative to {@link #getReferences()} ... {@see GCRun#getCandidateTableIDs}`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1170045020

   This is a cherry-pick of @mjwall 's GC fix on the 1.10 branch applied to 2.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977716754


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   > @keith-turner - did you want me to wait to merge for this testing?
   
   No, if I were to find anything it could be handled in a new PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r980267845


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   I ran the following little script for a bit.  Let it create/delete around 80 tables.  Looked in the GC log and did not see anything.
   
   ```bash
   #!/bin/bash
   
   
   while true
   do
   
     cat << EOF | accumulo shell
   createtable foo
   insert r1 f1 q1 vi
   compact -w
   scan -t accumulo.metadata -np
   deletetable -f foo
   EOF
     sleep 10
   done
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion merged pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion merged PR #2792:
URL: https://github.com/apache/accumulo/pull/2792


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r973225860


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +479,20 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  @Override
+  public Set<TableId> getCandidateTableIDs() {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>(getTableIDs());
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      return tableIds;
+    } else {
+      throw new RuntimeException("Unexpected Table in GC Env: " + this.level.name());

Review Comment:
   Addressed in 1f8d37305789fa1540158e32b54da86a7af3ae87



##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -135,10 +137,14 @@ private SortedMap<String,String> makeRelative(Collection<String> candidates) {
 
   private void removeCandidatesInUse(GarbageCollectionEnvironment gce,
       SortedMap<String,String> candidateMap) {
+
+    Set<TableId> tableIdsBefore = gce.getCandidateTableIDs();
+    Set<TableId> tableIdsSeen = new HashSet<>();
     var refStream = gce.getReferences();
     Iterator<Reference> iter = refStream.iterator();
     while (iter.hasNext()) {
       Reference ref = iter.next();
+      tableIdsSeen.add(ref.getTableId());

Review Comment:
   Addressed in 1f8d37305789fa1540158e32b54da86a7af3ae87



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r975734250


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen Zookeeper. "
+          + "This can have two causes. First, total number of tables going to/from "
+          + "zero during a GC cycle will cause this. Second, it could be caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the second cause "
+          + "is problematic, but there is no way to distinguish between the two causes "
+          + "so this GC cycle will not proceed. The first cause should be transient "
+          + "and one would not expect to see this message repeated in subsequent GC cycles.");
+    }

Review Comment:
   It doesn't kill the GC process. The exception bubbles up to SimpleGarbageCollector and gets logged. SimpleGarbageCollector is in a loop, so it will retry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r974140440


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -462,4 +502,39 @@ public long getErrorsStat() {
   public long getCandidatesStat() {
     return candidates;
   }
+
+  /**
+   * Return a set of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids in an ONLINE or
+   * OFFLINE state. When operating on DataLevel.METADATA this will return the table id for the
+   * accumulo.metadata table. When operating on DataLevel.ROOT this will return the table id for the
+   * accumulo.root table.
+   *
+   * @return The table ids
+   * @throws InterruptedException
+   *           if interrupted when calling ZooKeeper
+   */
+  @Override
+  public Set<TableId> getCandidateTableIDs() throws InterruptedException {
+    if (level == DataLevel.ROOT) {
+      return Set.of(RootTable.ID);
+    } else if (level == DataLevel.METADATA) {
+      return Collections.singleton(MetadataTable.ID);
+    } else if (level == DataLevel.USER) {
+      Set<TableId> tableIds = new HashSet<>();
+      tableIds.remove(MetadataTable.ID);
+      tableIds.remove(RootTable.ID);
+      getTableIDs().forEach((k, v) -> {
+        if (v == TableState.ONLINE || v == TableState.OFFLINE) {
+          // Don't return tables that are NEW, DELETING, or in an
+          // UNKNOWN state.
+          tableIds.add(k);
+        }
+      });

Review Comment:
   Fixed in cdaf408



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972860972


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   Maybe its confusing because it goes more with `GCRun.getCandidateTableIDs()`.  Could do the following to make it less confusing.
   
   * Move the current javadoc on the interface to `GCRun.getCandidateTableIDs()` and replace `when operating on` with `when the current data level is`
   * Update the javadoc on this interface to make it generic like : `Return a set of all TableIDs that should be seen in {@link #getReferences()} at the current time.  Immediately after this method returns the information it produced may be out of date relative to {@link #getReferences()} ... {@see GCRun#getCandidateTableIDs}`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972865275


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {

Review Comment:
   After looking at the javadoc that @ctubbsii  said was confusing and trying to improve it, i think the filtering on table state should be done in `getCandidateTableIDs()`.  So `getCandidateTableIDs()` only returns table ids where the state is ONLINE or OFFLINE in its impl.  Then ensureAllTablesChecked would continue to take sets and stay the same, not taking the maps as I suggested earlier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972860972


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,17 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of all TableIDs in the
+   * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
+   * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
+   * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
+   * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
+   *
+   * @return The table ids
+   */
+  Set<TableId> getCandidateTableIDs();

Review Comment:
   Maybe its confusing because it goes more with `GCRun.getCandidateTableIDs()`.  Could do the following to make it less confusing.
   
   * Move the current javadoc on this interface to `GCRun.getCandidateTableIDs()` and replace `when operating on` with `when the current data level is` when moving
   * Update the javadoc on this interface to make it generic like : `Return a set of all TableIDs that should be seen in {@link #getReferences()} at the current time.  Immediately after this method returns the information it produced may be out of date relative to {@link #getReferences()} ... {@see GCRun#getCandidateTableIDs}`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r957281922


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   *
+   */

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));
+
+    String msg = assertThrows(RuntimeException.class, () -> gca.collect(gce)).getMessage();
+    assertTrue((msg.contains("[b, d]") || msg.contains("[d, b]"))
+        && msg.contains("Saw table IDs in ZK that were not in metadata table:"), msg);
+  }
+
+  // below are tests for potential failure conditions of the GC process. Some of these cases were
+  // observed on clusters. Some were hypothesis based on observations. The result was that
+  // candidate entries were not removed when they should have been and therefore files were
+  // removed from HDFS that were actually still in use
+
+  private Set<TableId> makeUnmodifiableSet(String... args) {
+    Set<TableId> t = new HashSet<>();
+    for (String arg : args) {
+      t.add(TableId.of(arg));
+    }
+    return Collections.unmodifiableSet(t);
+  }
+
+  @Test
+  public void testNormalGCRun() {
+    // happy path, no tables added or removed during this portion and all the tables checked
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddle() {
+    // table was added during this portion and we don't see it, should be fine
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableAddedInMiddleTwo() {
+    // table was added during this portion and we DO see it
+    // Means table was added after candidates were grabbed, so there should be nothing to remove
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "3", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "3", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddle() {
+    // table was deleted during this portion and we don't see it
+    // this mean any candidates from the deleted table wil stay on the candidate list
+    // and during the delete step they will try to removed
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testTableDeletedInMiddleTwo() {
+    // table was deleted during this portion and we DO see it
+    // this mean candidates from the deleted table may get removed from the candidate list
+    // which should be ok, as the delete table function should be responsible for removing those
+    Set<TableId> tablesBefore = makeUnmodifiableSet("1", "2", "3", "4");
+    Set<TableId> tablesSeen = makeUnmodifiableSet("2", "1", "4", "3");
+    Set<TableId> tablesAfter = makeUnmodifiableSet("1", "2", "4");
+
+    new GarbageCollectionAlgorithm().ensureAllTablesChecked(tablesBefore, tablesSeen, tablesAfter);
+  }
+
+  @Test
+  public void testMissEntireTable() {
+    // this test simulates missing an entire table when looking for what files are in use
+    // if you add custom splits to the metadata at able boundaries, this can happen with a failed
+    // scan
+    // recall the ~tab:~pr for this first entry of a new table is empty, so there is now way to
+    // check the prior row. If you split a couple of tables in the metadata the table boundary
+    // , say table ids 2,3,4, and then miss scanning table 3 but get 4, it is possible other
+    // consistency checks will miss this

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r957280963


##########
server/gc/src/test/java/org/apache/accumulo/gc/GarbageCollectionTest.java:
##########
@@ -778,4 +838,152 @@ public void bulkImportReplicationRecordsPreventDeletion() throws Exception {
     assertEquals(1, gce.deletes.size());
     assertEquals("hdfs://foo.com:6000/accumulo/tables/2/t-00002/A000002.rf", gce.deletes.get(0));
   }
+
+  @Test
+  public void testMissingTableIds() throws Exception {
+    GarbageCollectionAlgorithm gca = new GarbageCollectionAlgorithm();
+
+    TestGCE gce = new TestGCE();
+
+    gce.candidates.add("hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+
+    gce.addFileReference("a", null, "hdfs://foo.com:6000/user/foo/tables/a/t-0/F00.rf");
+    gce.addFileReference("c", null, "hdfs://foo.com:6000/user/foo/tables/c/t-0/F00.rf");
+
+    // add 2 more table references that will not be seen by in the scan
+    gce.tableIds.addAll(Arrays.asList(TableId.of("b"), TableId.of("d")));

Review Comment:
   Addressed in c4e0ab918bf002c83d4200af67df0578aff44e73



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
dlmarion commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970716893


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,14 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of TableIDs for which we are considering deletes. For the root table this would
+   * be the metadata table. For the metadata table, this would be the other tables in the system.

Review Comment:
   Implemented in a61969020f7ca916465811bac22b52874c3f02cd



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#issuecomment-1246898102

   @dlmarion  I made some updatesto the test  in this commit keith-turner/accumulo@78cf05d9af90bb7ba218a17cda6755df80127702 to use DataLevel and to not have to add the metadata refs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r970577888


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionEnvironment.java:
##########
@@ -72,6 +72,14 @@ public interface GarbageCollectionEnvironment {
    */
   Stream<Reference> getReferences();
 
+  /**
+   * Return a list of TableIDs for which we are considering deletes. For the root table this would
+   * be the metadata table. For the metadata table, this would be the other tables in the system.

Review Comment:
   ```suggestion
      * Return a list of all TableIDs in the
      * {@link org.apache.accumulo.core.metadata.schema.Ample.DataLevel} for which we are considering
      * deletes. When operating on DataLevel.USER this will return all user table ids. When operating
      * on DataLevel.METADATA this will return the table id for the accumulo.metadata table. When
      * operating on DataLevel.ROOT this will return the table id for the accumulo.root table.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #2792: closes #1377 - ensure all tables are checked ...

Posted by GitBox <gi...@apache.org>.
keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r969605447


##########
server/gc/src/main/java/org/apache/accumulo/gc/GCRun.java:
##########
@@ -172,7 +177,19 @@ public Stream<Reference> getReferences() {
 
   @Override
   public Set<TableId> getTableIDs() {
-    return context.getTableIdToNameMap().keySet();
+    final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
+    final ZooReader zr = context.getZooReader();
+    final Set<TableId> tids = new HashSet<>();
+    while (true) {
+      try {
+        zr.sync(tablesPath);
+        zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
+        return tids;
+      } catch (KeeperException | InterruptedException e) {
+        log.error("Error getting tables from ZooKeeper, retrying", e);
+        UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
+      }
+    }

Review Comment:
   Maybe the new `getCandidateTableIDs()` method is not needed and this method could be modified.  Looking around at the code there is only one place this is used currently and I think this change would be good for that code.   The existing code that uses it may not be correct for metadata and root  levels (but I think luckily it may not matter).
   
   ```suggestion
       switch(level){
       case ROOT:
           return Set.of(RootTable.ID);
       case METADATA:
           return Set.of(MetadataTable.ID);
       case USER:
         final String tablesPath = context.getZooKeeperRoot() + Constants.ZTABLES;
         final ZooReader zr = context.getZooReader();
         final Set<TableId> tids = new HashSet<>();
         while (true) {
           try {
             zr.sync(tablesPath);
             zr.getChildren(tablesPath).forEach(t -> tids.add(TableId.of(t)));
             tids.remove(MetadataTable.ID);
             tids.remove(RootTable.ID);
             return tids;
           } catch (KeeperException | InterruptedException e) {
             log.error("Error getting tables from ZooKeeper, retrying", e);
             UtilWaitThread.sleepUninterruptibly(1, TimeUnit.SECONDS);
           }
         }
       default:
          throw new IllegalArgumentException("unknown level "+level);
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org