You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "dlmarion (via GitHub)" <gi...@apache.org> on 2023/08/03 12:52:49 UTC

[GitHub] [accumulo] dlmarion opened a new pull request, #3677: Attempt to prevent half-closed Tablet due to failing minc

dlmarion opened a new pull request, #3677:
URL: https://github.com/apache/accumulo/pull/3677

   Prior to this change Tablet.initiateClose would close the compactable, wait for the current minor compaction to finish, and then kick off another minor compaction. In the case where minor compactions are failing, the compactable will get closed and the call to wait for the current minor compaction to finish will hang indefinitely leaving the Tablet in a half-closed state. Tablet.initiateClose is called when either the TabletServer is closing the Tablet (due to migration or shutdown) or on a Tablet split. Therefore, a failing minor compaction will prevent Tablet migration and split or normal TabletServer shutdown.
   
   This change adds some logic at the start of Tablet.initiateClose to try and detect a currently or previously failing minor compaction. In these cases an exception is thrown before the compactable is closed to prevent the Tablet from being in a half-closed state. This prevents the Tablet from being closed until the cause for the failing minor compaction is corrected. Causes could be a bad iterator applied at minor compaction or a bad table configuration option (like classloader context). When the cause of the failing minor compaction is corrected, then the Tablet should be able to be closed normally allowing normal migration, split, and TabletServer shutdown.
   
   Fixes #3674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1664229833

   Kicked off full IT build


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1284404562


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   GIven the changes in #3678 , this message should hopefully not appear that much anymore. I think it's fine to leave this at ERROR in this case because that would mean a failure of the check in #3678 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283497937


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   In the case where an invalid classloader context is set, then `Property.createTableInstanceFromPropertyName` cannot even return the *default* CompactionDispatcher specified in `Property.TABLE_COMPACTION_DISPATCHER`. This being null is a symptom of a misconfiguration of the context class loader. I think it's correct to point out the error here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1284395630


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now.
+          if (!initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            String msg = "Unable to initiate minc for close on " + extent
+                + ". Tablet might be closed or deleting.";
+            log.warn(msg);
+            throw new RuntimeException(msg);
+          }
+          // Not calling getTabletMemory().waitForMinC(); as it will wait
+          // indefinitely for a successful minor compaction. It's possible
+          // that won't happen. We just want to know if the current minc
+          // fails.
+          while (isMinorCompactionRunning()) {
+            UtilWaitThread.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);

Review Comment:
   This was removed in 925e26e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291714696


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // remove the bad iterator. The failing minc task is in a backoff retry loop
+      // and should pick up this change on the next try
+      c.tableOperations().removeIterator(tableName, BadIterator.class.getSimpleName(),
+          EnumSet.of(IteratorScope.minc));
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+    }
+  }
+
+  public static void setInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().putAll(key,
+        Map.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid"));
+  }
+
+  public static void removeInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,

Review Comment:
   This comment is just a general question. Wondering if we need this method, it validation performed when a prop is deleted that prevents removal of a bad context config?  If so it seems like that could be a general problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1675394663

   > It would be good in the future to add some metrics or something that can be checked in the test to confirm that failures do happen on the minc before the configuration is corrected.
   
   Wondering if a metric that counts failed minor compactions would be useful.  Should normally be zero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion merged pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion merged PR #3677:
URL: https://github.com/apache/accumulo/pull/3677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1284393844


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now.

Review Comment:
   This was reworked in 925e26e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1284394400


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown

Review Comment:
   This was fixed in 925e26e



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283499099


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {
+        log.error(

Review Comment:
   Again, this points to an error in the classloader context configuration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dtspence commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dtspence (via GitHub)" <gi...@apache.org>.
dtspence commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1664516280

   @dlmarion We believe we found a code path that produces a test timeout. It appeared to be related to a minc thread exiting due to not being able to obtain a mapfile (i.e. volume chooser seeing invalid context). The code-path in the integration test saw an error within the `MinorCompactor` and logged a retrying minc operation (possibly leaving minc thread active).
   
   The calling path that causes the thread to end (i.e. context error w/volume-chooser) logged the tablets were unable to unload.
   
   The logs report the following (w/minc thread exit path) when attempting to shutdown:
   ```
   2023-08-03T16:03:32,503 [tserver.UnloadTabletHandler] INFO : Tablet unload for extent 1<< requested.
   2023-08-03T16:03:32,503 [tablet.Tablet] DEBUG: 1<< closeState OPEN tabletMemory.memoryReservedForMinC() true tabletMemory.getMemTable().getNumEntries() 0 updatingFlushID false
   2023-08-03T16:03:32,503 [tablet.Tablet] WARN : Unable to initiate minc for close on 1<<. Tablet might be closed or deleting.
   2023-08-03T16:03:32,503 [tserver.UnloadTabletHandler] ERROR: Failed to close tablet 1<<... Aborting migration
   java.lang.RuntimeException: Unable to initiate minc for close on 1<<. Tablet might be closed or deleting.
           at org.apache.accumulo.tserver.tablet.Tablet.initiateClose(Tablet.java:925) ~[classes/:?]
           at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:898) ~[classes/:?]
           at org.apache.accumulo.tserver.UnloadTabletHandler.run(UnloadTabletHandler.java:92) [classes/:?]
           at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) [classes/:?]
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
           at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) [classes/:?]
           at java.lang.Thread.run(Thread.java:829) [?:?]
   ```
   
   To replicate, the following is needed as base configuration:
   
   ```java
       @Override
       public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
         cfg.setNumTservers(1);
         cfg.setProperty("general.volume.chooser", "org.apache.accumulo.core.spi.fs.DelegatingChooser");
         cfg.setProperty("general.custom.volume.chooser.default",
                 "org.apache.accumulo.core.spi.fs.PreferredVolumeChooser");
         cfg.setProperty("general.custom.volume.preferred.default", "file:/home/dtspen2/dev/git/dlmarion/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.HalfClosedTabletIT_SharedMiniClusterBase/accumulo");
       }
   ```
   
   Then a `flush()` operation which caused the minc thread to end:
   ```java
         tops.setProperty(tableName, Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid");
   
         tops.flush(tableName);
   
         Thread.sleep(500);
   
         // This should fail to split, but not leave the tablets in a state where they can't
         // be unloaded
         assertThrows(AccumuloServerException.class,
             () -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
   
         tops.removeProperty(tableName, Property.TABLE_CLASSLOADER_CONTEXT.getKey());
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289962566


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletClientHandler.java:
##########
@@ -1009,6 +1009,9 @@ public void splitTablet(TInfo tinfo, TCredentials credentials, TKeyExtent tkeyEx
       } catch (IOException e) {
         log.warn("Failed to split " + keyExtent, e);
         throw new RuntimeException(e);
+      } catch (RuntimeException re) {
+        log.warn("Failed to split " + keyExtent, re);

Review Comment:
   Yes, otherwise there is nothing in the log on the server side IIRC. The RuntimeException gets passed back to the client, but when that happens most of the detail is lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1290426667


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   Changed log level of the ConfigurationTypeHelper.getClassInstance message in 31260e0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291725916


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // remove the bad iterator. The failing minc task is in a backoff retry loop
+      // and should pick up this change on the next try
+      c.tableOperations().removeIterator(tableName, BadIterator.class.getSimpleName(),
+          EnumSet.of(IteratorScope.minc));
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+    }
+  }
+
+  public static void setInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().putAll(key,
+        Map.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid"));
+  }
+
+  public static void removeInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().removeProperties(key,
+        List.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey()));
+  }

Review Comment:
   I tried replacing this method locally w/ the following and changing the places where it was called and everything worked.
   
   ```suggestion
     public static void removeInvalidClassLoaderContextProperty(AccumuloClient client,
         String tableName) {
       try {
         client.tableOperations().removeProperty(tableName, Property.TABLE_CLASSLOADER_CONTEXT.getKey());
       } catch (AccumuloException | AccumuloSecurityException e) {
         throw new RuntimeException(e);
       }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1286666547


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTablet2IT.java:
##########
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import java.io.File;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.spi.fs.DelegatingChooser;
+import org.apache.accumulo.core.spi.fs.PreferredVolumeChooser;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+//This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+//where a failing minor compaction causes a Tablet.initiateClose to leave the
+//Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+//the TabletServer cannot be shutdown normally. Because the minor compaction has
+//been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//The test in this class sets up a VolumeChooser configuration on MiniAccumuloCluster
+//to test a specific code path that caused the minor compaction thread to die.

Review Comment:
   My suggestion to make this description slightly more concise, and adding spaces after the double-slashes to conform to our normal comment style.
   
   ```suggestion
   // This covers issues like that reported in https://github.com/apache/accumulo/issues/3674
   // where a failing minor compaction leaves the Tablet in a half-closed state that prevents it
   // from unloading and the TServer from shutting down normally.
   // This test recreates that scenario by setting an invalid context and verifies that the
   // tablet can recover and unload after the context is set correctly.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291739396


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}

Review Comment:
   I updated the java comment in my first comment on this thread



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1675400077

   > Yeah, that's what I'm thinking.
   
   Ok, that is a good idea.  Are you going to open an issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289194367


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {
+        log.error(
+            "Failed to dispatch compaction {} kind:{} hints:{}, falling back to {} service. Check server log.",

Review Comment:
   ```suggestion
               "Failed to dispatch compaction {} kind:{} hints:{}, falling back to {} service. Unable to instantiate dispatcher plugin. Check server log.",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1285070313


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTablet2IT.java:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import java.io.File;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.spi.fs.DelegatingChooser;
+import org.apache.accumulo.core.spi.fs.PreferredVolumeChooser;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+public class HalfClosedTablet2IT extends SharedMiniClusterBase {

Review Comment:
   comments added in 05cd496



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291730070


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);

Review Comment:
   I opted for a larger time because sometimes tests run at different speeds on different machines. Lowering could cause intermittent failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291727885


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);

Review Comment:
   I think it would be ok to lower this time
   
   ```suggestion
         UtilWaitThread.sleepUninterruptibly(5, TimeUnit.SECONDS);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291728469


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}

Review Comment:
   I think it would ok to lower this time and simplify it a bit.
   
   ```suggestion
         // The offine operation should not be able to complete because the tablet can not minor compact, give the offline operation a bit of time to attempt to complete even though it should never be able to complete.
         UtilWaitThread.sleepUninterruptibly(5, TimeUnit.SECONDS);
   
         assertTrue(countHostedTablets(c, tid) > 0);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283430858


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   The NPE was one of the things spamming our logs when the context was incorrect.  This at least avoid having a stack trace being sent with each one.  Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289975185


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.clientImpl.AccumuloServerException;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      Thread.sleep(500);
+
+      // This should fail to split, but not leave the tablets in a state where they can't
+      // be unloaded
+      assertThrows(AccumuloServerException.class,
+          () -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
+
+      removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);

Review Comment:
   My guess is that it works because I changed the catch clause in MinorCompactionTask from IOException to Exception. If you were to revert that single change, then I think this would fail and leave the tablet in a bad state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289984526


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +902,39 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then throw
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now. It's possible
+          // that a minc will not be initiated (if there is no data in the table for example)
+          if (initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            getTabletMemory().waitForMinC();
+          }
+        }
+      }
+      // We don't want to initiate the close process on the tablet if the last minor compaction
+      // failed. Let the user resolve whatever the issue may be so that we get a successful
+      // minor compaction on the tablet before closing it.
+      if (!isLastMinCSuccessful()) {
+        String msg = "Aborting close on " + extent
+            + " because last minor compaction was not successful. Check the server log.";
+        log.warn(msg);
+        throw new RuntimeException(msg);
+      }
+
+    }
+

Review Comment:
   > If I understand correctly, the observed issue was fixed by catching Exception instead of IOException in MinorCompaction task. If that is correct then I think it may be best to remove this code for the following reasons.
   
   I believe that is true, I think that solved this specific issue. The code I added in `Tablet.initiateClose` was to try and prevent the tablet from getting into a bad state for some other unknown failure scenario. My concern is that a user could miss the fact that minor compactions are failing. Then, either due to a call to `split`, Tablet migration, or TabletServer close the Tablet gets put into a half-closed state because `compactable.close` is called. The only remedy for this situation currently is a hard shutdown of the TabletServer. The user does not have the option of fixing the issue and getting a successful minor compaction. Depending on how long this has been occurring, the recovery time could be lengthy. It this issue affects multiple TabletServers, then you could be looking at an entire system restart with a long recovery period.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283438529


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now.
+          if (!initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            String msg = "Unable to initiate minc for close on " + extent
+                + ". Tablet might be closed or deleting.";
+            log.warn(msg);
+            throw new RuntimeException(msg);
+          }
+          // Not calling getTabletMemory().waitForMinC(); as it will wait
+          // indefinitely for a successful minor compaction. It's possible
+          // that won't happen. We just want to know if the current minc
+          // fails.
+          while (isMinorCompactionRunning()) {
+            UtilWaitThread.sleepUninterruptibly(50, TimeUnit.MILLISECONDS);

Review Comment:
   This could wait forever ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289956468


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +902,39 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then throw
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now. It's possible
+          // that a minc will not be initiated (if there is no data in the table for example)
+          if (initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            getTabletMemory().waitForMinC();
+          }
+        }
+      }
+      // We don't want to initiate the close process on the tablet if the last minor compaction
+      // failed. Let the user resolve whatever the issue may be so that we get a successful
+      // minor compaction on the tablet before closing it.
+      if (!isLastMinCSuccessful()) {
+        String msg = "Aborting close on " + extent
+            + " because last minor compaction was not successful. Check the server log.";
+        log.warn(msg);
+        throw new RuntimeException(msg);
+      }
+
+    }
+

Review Comment:
   > If we find more places where the minor compaction code could have retried and it did not, then I think we should modify that minor compaction code to make it retry there.
   
   If we do that, and remove this check or something like it, then until we found and fixed all of the places where this happens Tablets could still get into a bad state where the only remedy is a restart of all TabletServers.
   
   > This code has some race conditions, for example a minor compaction could fail in another thread right after the check in this function. So the checks here do not completely solve the problem.
   
   Agreed, it's best effort. Maybe there is a way to fix the race condition and prevent another minc from starting? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289963518


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {

Review Comment:
   Yes, when the context was invalid the logs were being polluted with stack traces.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1290544626


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.clientImpl.AccumuloServerException;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      Thread.sleep(500);
+
+      // This should fail to split, but not leave the tablets in a state where they can't
+      // be unloaded
+      assertThrows(AccumuloServerException.class,
+          () -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
+
+      removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);

Review Comment:
   I made these changes in 9f84ed4, however 2/3 of these new tests no longer work. I believe that is because I merged 3678 into 2.1 and into this branch. The tests that are failing are setting an invalid context, and when ClassLoaderUtil.getClassLoader is called, the context is not valid, and the non-context aware classloader is returned instead of the code returning null or throwing an error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291727612


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));

Review Comment:
   I know this is the default, feel like it documents intent in the test.
   
   ```suggestion
         c.tableOperations().compact(tableName, new CompactionConfig().setWait(true).setFlush(true));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291728469


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}

Review Comment:
   I think it would ok to lower this time and simplify it a bit.
   
   ```suggestion
         // the offine operation should not be able to complete because the tablet can not minor compact
         UtilWaitThread.sleepUninterruptibly(5, TimeUnit.SECONDS);
   
         assertTrue(countHostedTablets(c, tid) > 0);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283436504


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown

Review Comment:
   thrown -> throw



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1290542699


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -190,6 +190,7 @@ enum CompactionState {
   private CompactableImpl compactable;
 
   private volatile CompactionState minorCompactionState = null;
+  private volatile boolean lastMinCSuccessful = true;

Review Comment:
   Removed in 31260e0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291736481


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}

Review Comment:
   No matter how long we sleep it's expected that there will be a non-zero tablets, so thinking we do not need to sleep that long. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291717460


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // remove the bad iterator. The failing minc task is in a backoff retry loop
+      // and should pick up this change on the next try
+      c.tableOperations().removeIterator(tableName, BadIterator.class.getSimpleName(),
+          EnumSet.of(IteratorScope.minc));
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+    }
+  }
+
+  public static void setInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().putAll(key,
+        Map.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid"));
+  }
+
+  public static void removeInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,

Review Comment:
   I could try running it without this method. I think you might be correct, that it's not needed. It could also be removed and tested in a follow-on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1665542495

   I looked into this path where the minor compaction thread was dying because an invalid classloader context could not load the volume chooser to create the minor compaction output file.  The issue is more serious that what I initially thought, but thankfully the fix is easy.
   
   TLDR - A failed minor compaction thread might likely mean that subsequent minor compactions for a Tablet will **not** occur.  The fix is to catch `Exception` rather than `IOException` in `MinorCompactionTask` when creating the output file as in this scenario a `RuntimeException` is thrown.
   
   
   Longer analysis
   ---
   
   `TableOperationsImpl.flush` calls `ManagerClientServiceHandler.initiateFlush` which increments
   the value of the tables `ZTABLE_FLUSH_ID` node in ZooKeeper, and returns the new value. Then
   `TableOperationsImpl.flush` calls `ManagerClientServiceHandler.waitForFlush` passing in the flushId.
   `ManagerClientServiceHandler.waitForFlush` initially calls `TabletClientHandler.flush` on *every tablet server in the cluster*. If the client is not waiting for the flush to return then this returns and `TableOperationsImpl.flush` is done. If the client is waiting, then `ManagerClientServiceHandler` looks for tablets that are hosted or has walogs and whose flushId in the tablet metadata is less than the flushId returned from `initiateFlush`. Then `waitForFlush` continues to call `TabletClientHandler.flush` on the hosts that have tablets that meet this criteria for Long.MAX_VALUE iterations.
   
   `TabletClientHandler.flush` does nothing if the TabletServer does not have any tablets for the table. Otherwise it gets the current value of the tables `ZTABLE_FLUSH_ID`, which should be the same as the flushId returned at the beginning of this process. `Tablet.flush` is called on each of the Tablets for the Table on this TabletServer.
   
   If the Tablet has no entries in memory, then the tablet metadata is updated with the flushId and the variable `lastFlushID` is set to the flushId. If `Tablet.flush()` was called again on this Tablet with the same flushID, then it would do nothing.
   
   If the Tablet has entries in memory, then `Tablet.initiateMinorCompaction` is called, which may or may not create a MinorCompactionTask. If one is created, then it is executed in the minorCompactionThreadPool. `Tablet.initiateMinorCompaction` will not return a MinorCompactionTask if the tablet is closing, is closed, if there are no entries in memory, if there is another thread updating `lastFlushID`, or if the Tablets memory is already reserved for a minor compaction (meaning if one is already running).
   
   MinorCompactionTask when executed will create a new output file and call `TabletServer.minorCompactionStarted`, then it calls `Tablet.minorCompact()`. `Tablet.minorCompact` performs the minor compaction, then brings the minor compaction online by calling `DatafileManager.bringMinorCompactionOnline`, then calls `TabletMemory.finalizeMinC`. `DatafileManager.bringMinorCompactionOnline` sets `lastFlushID` and calls `Tablet.flushComplete` which sets `lastFlushID` and calls `TabletMemory.finishedMinC`.
   
   If `TabletMemory.finalizeMinC` and `TabletMemory.finishedMinC` are not called, then it appears that the tablet memory is already reserved for a minor compaction and a subsequent minor compaction will *never* start. This appears to be what is happening when there is an invalid context on the table as the call to create a new output file returns a RuntimeException. Only IOExceptions are caught at this location and the Thread dies due to the uncaught exception.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1285001891


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTablet2IT.java:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import java.io.File;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.spi.fs.DelegatingChooser;
+import org.apache.accumulo.core.spi.fs.PreferredVolumeChooser;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+public class HalfClosedTablet2IT extends SharedMiniClusterBase {

Review Comment:
   The other IT got a javadoc, but it's not clear why this similarly named one exists in a separate file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1290423771


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +902,39 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then throw
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now. It's possible
+          // that a minc will not be initiated (if there is no data in the table for example)
+          if (initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            getTabletMemory().waitForMinC();
+          }
+        }
+      }
+      // We don't want to initiate the close process on the tablet if the last minor compaction
+      // failed. Let the user resolve whatever the issue may be so that we get a successful
+      // minor compaction on the tablet before closing it.
+      if (!isLastMinCSuccessful()) {
+        String msg = "Aborting close on " + extent
+            + " because last minor compaction was not successful. Check the server log.";
+        log.warn(msg);
+        throw new RuntimeException(msg);
+      }
+
+    }
+

Review Comment:
   @EdColeman , @keith-turner , and I had a discussion about this. I'm going to remove this code from this PR and @keith-turner is going to evalutate doing something similar, but in a different manner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289153602


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +902,39 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then throw
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now. It's possible
+          // that a minc will not be initiated (if there is no data in the table for example)
+          if (initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            getTabletMemory().waitForMinC();
+          }
+        }
+      }
+      // We don't want to initiate the close process on the tablet if the last minor compaction
+      // failed. Let the user resolve whatever the issue may be so that we get a successful
+      // minor compaction on the tablet before closing it.
+      if (!isLastMinCSuccessful()) {
+        String msg = "Aborting close on " + extent
+            + " because last minor compaction was not successful. Check the server log.";
+        log.warn(msg);
+        throw new RuntimeException(msg);
+      }
+
+    }
+

Review Comment:
   If I understand correctly, the observed issue was fixed by catching Exception instead of IOException in MinorCompaction task.  If that is correct then I think it may be best to remove this code for the following reasons.
   
    * Think its better to make the minor compaction code aggressively retry (as was done in other parts of this PR) instead of attempting to handle it not retrying here.  If we find more places where the minor compaction code could have retried and it did not, then I think we should modify that minor compaction code to make it retry there.
    * This code has some race conditions, for example a minor compaction could fail in another thread right after the check in this function.  So the checks here do not completely solve the problem.
    * I think there is a bug in the code.  The call to waitForMinC() must be called with the tablet lock held or it will throw an exception, I do not think the tablet lock is held when its called.
    * This function is already initiating a minor compaction.  If we are going to initiate another minor compaction in the function it may be best to remove the later one.  However that is tricky because that later minc is coordinated with updating the flush id by the function.
   
   ```suggestion
   ```



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {

Review Comment:
   Was this if stmt added to avoid logging an NPE later in the code?



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {
+        log.error(
+            "Failed to dispatch compaction {} kind:{} hints:{}, falling back to {} service. Check server log.",

Review Comment:
   ```suggestion
               "Failed to dispatch compaction {} kind:{} hints:{}, falling back to {} service. Unable to instantiate dispatcher plugin.  Check server log.",
   ```



##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   Not sure if this matters.  The log level of error here does not match the log level of the message in `ConfigurationTypeHelper.getClassInstance()` which is warn.  The log message from `ConfigurationTypeHelper.getClassInstance()` will probably contain the information the user needs to determine why the compaction dispatcher plugin did not load.



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletClientHandler.java:
##########
@@ -1009,6 +1009,9 @@ public void splitTablet(TInfo tinfo, TCredentials credentials, TKeyExtent tkeyEx
       } catch (IOException e) {
         log.warn("Failed to split " + keyExtent, e);
         throw new RuntimeException(e);
+      } catch (RuntimeException re) {
+        log.warn("Failed to split " + keyExtent, re);

Review Comment:
   Was this code only added to log additional information? 



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -190,6 +190,7 @@ enum CompactionState {
   private CompactableImpl compactable;
 
   private volatile CompactionState minorCompactionState = null;
+  private volatile boolean lastMinCSuccessful = true;

Review Comment:
   Could remove this if the code that checks it in close is removed.



##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.clientImpl.AccumuloServerException;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      Thread.sleep(500);
+
+      // This should fail to split, but not leave the tablets in a state where they can't
+      // be unloaded
+      assertThrows(AccumuloServerException.class,
+          () -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
+
+      removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);

Review Comment:
   I tried removing the code in initiateClose and running the test locally.  Made the following changes to this test to get it running.
   
   ```suggestion
         Thread configFixer = new Thread(() -> {
           UtilWaitThread.sleep(3000);
           removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
                   tableId);
         });
   
         // grab this time before starting the thread starts and before it sleeps.
         long t1 = System.nanoTime();
         configFixer.start();
   
         // The split will probably start running w/ bad config that will cause it to get stuck.  However once the config is fixed by the background thread it should continue.
         tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
         long t2 = System.nanoTime();
   
         // expect that split took at least 3 seconds because that is the time it takes to fix the config
         assertTrue(TimeUnit.NANOSECONDS.toMillis(t2-t1) > 3000);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1675257235

   I re-ran the tests after #3683 and #3685 were merged into 2.1. Everything appears to be working. However, it's possible that the tests could pass without the minor compactions failing. This could happen if, for example, the minor compaction kicks off before the tablet server gets the property update to set an invalid context. I verified via log inspection that failures were happening and that it recovered. It would be good in the future to add some metrics or something that can be checked in the test to confirm that failures do happen on the minc before the configuration is corrected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291728223


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // remove the bad iterator. The failing minc task is in a backoff retry loop
+      // and should pick up this change on the next try
+      c.tableOperations().removeIterator(tableName, BadIterator.class.getSimpleName(),
+          EnumSet.of(IteratorScope.minc));
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+    }
+  }
+
+  public static void setInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().putAll(key,
+        Map.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid"));
+  }
+
+  public static void removeInvalidClassLoaderContextPropertyWithoutValidation(ServerContext context,
+      TableId tableId) {
+    TablePropKey key = TablePropKey.of(context, tableId);
+    context.getPropStore().removeProperties(key,
+        List.of(Property.TABLE_CLASSLOADER_CONTEXT.getKey()));
+  }

Review Comment:
   Yeah, I guess we don't need to check the logs. If the property removal did not get picked up, then tests would not have passed. I'll apply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291729318


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // remove the bad iterator. The failing minc task is in a backoff retry loop
+      // and should pick up this change on the next try
+      c.tableOperations().removeIterator(tableName, BadIterator.class.getSimpleName(),
+          EnumSet.of(IteratorScope.minc));
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);

Review Comment:
   ```suggestion
         // The previous operation to offline the table should be able to succeed after the minor compaction completed
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291730911


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);
+
+      // minc should fail, so there should be no files
+      FunctionalTestUtils.checkRFiles(c, tableName, 1, 1, 0, 0);
+
+      // tell the server to take the table offline
+      tops.offline(tableName);
+
+      try {
+        // We are expecting this to fail, the table should not be taken offline
+        // because the bad iterator will prevent the minc from completing.
+        Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 120_000);
+        fail("Zero hosted tablets for table.");
+      } catch (IllegalStateException e) {}

Review Comment:
   This too could cause intermittent test failures. The `waitFor` still has an upper limit, but will wait longer than 5 seconds if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283439240


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   Does it need to be an error however?  Could it be a debug or info?



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java:
##########
@@ -1473,6 +1473,13 @@ public CompactionServiceId getConfiguredService(CompactionKind kind) {
     try {
       var dispatcher = tablet.getTableConfiguration().getCompactionDispatcher();
 
+      if (dispatcher == null) {
+        log.error(

Review Comment:
   debug or info ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1283499765


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +903,47 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then thrown
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now.

Review Comment:
   I need to re-work this as calling `initiateMinorCompaction` when closing will always return false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291226865


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.clientImpl.AccumuloServerException;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      Thread.sleep(500);
+
+      // This should fail to split, but not leave the tablets in a state where they can't
+      // be unloaded
+      assertThrows(AccumuloServerException.class,
+          () -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
+
+      removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);

Review Comment:
   @keith-turner - If #3683 is merged, then I think that these tests will work again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1289961638


##########
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java:
##########
@@ -178,6 +178,14 @@ private static CompactionDispatcher createCompactionDispatcher(AccumuloConfigura
     CompactionDispatcher newDispatcher = Property.createTableInstanceFromPropertyName(conf,
         Property.TABLE_COMPACTION_DISPATCHER, CompactionDispatcher.class, null);
 
+    if (newDispatcher == null) {
+      // return early to prevent NPE
+      log.error(

Review Comment:
   Before this change the log file was getting peppered with NPE stack traces from below. This change was to make a single line log entry. I can't tell if you are making a suggestion with your comment on changing the log level or removing this, or just stating that the user should be able to find the information in the log as suggested in this log entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1290542418


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java:
##########
@@ -901,6 +902,39 @@ public void close(boolean saveState) throws IOException {
   void initiateClose(boolean saveState) {
     log.trace("initiateClose(saveState={}) {}", saveState, getExtent());
 
+    // Check to see if last or current minc is failing. If so, then throw
+    // an exception before closing the compactable and leaving the tablet
+    // in a half-closed state. Don't throw IllegalStateException because
+    // calling code will just continue to retry.
+    if (saveState) {
+      if (!isLastMinCSuccessful()) {
+        if (isMinorCompactionRunning()) {
+          // Then current minor compaction is retrying and failure is being
+          // reported.
+          String msg = "Aborting close on " + extent
+              + " because last minor compaction was not successful. Check the server log.";
+          log.warn(msg);
+          throw new RuntimeException(msg);
+        } else {
+          // We don't know when the last minc occurred. Kick one off now. It's possible
+          // that a minc will not be initiated (if there is no data in the table for example)
+          if (initiateMinorCompaction(MinorCompactionReason.CLOSE)) {
+            getTabletMemory().waitForMinC();
+          }
+        }
+      }
+      // We don't want to initiate the close process on the tablet if the last minor compaction
+      // failed. Let the user resolve whatever the issue may be so that we get a successful
+      // minor compaction on the tablet before closing it.
+      if (!isLastMinCSuccessful()) {
+        String msg = "Aborting close on " + extent
+            + " because last minor compaction was not successful. Check the server log.";
+        log.warn(msg);
+        throw new RuntimeException(msg);
+      }
+
+    }
+

Review Comment:
   Removed this code in 31260e0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1674035784

   #3685 moves the first minor compaction in tablet close to happen before any variables are set that begin the close process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on a diff in pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on code in PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#discussion_r1291734516


##########
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java:
##########
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   https://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.accumulo.test.functional;
+
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.EnumSet;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.accumulo.core.client.Accumulo;
+import org.apache.accumulo.core.client.AccumuloClient;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.IteratorSetting;
+import org.apache.accumulo.core.client.admin.CompactionConfig;
+import org.apache.accumulo.core.clientImpl.ClientContext;
+import org.apache.accumulo.core.conf.Property;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.data.TableId;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
+import org.apache.accumulo.core.metadata.schema.TabletMetadata.ColumnType;
+import org.apache.accumulo.core.metadata.schema.TabletsMetadata;
+import org.apache.accumulo.core.util.UtilWaitThread;
+import org.apache.accumulo.harness.MiniClusterConfigurationCallback;
+import org.apache.accumulo.harness.SharedMiniClusterBase;
+import org.apache.accumulo.miniclusterImpl.MiniAccumuloConfigImpl;
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.conf.store.TablePropKey;
+import org.apache.accumulo.test.util.Wait;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import com.google.common.collect.Sets;
+
+// This IT tests the cases seen in https://github.com/apache/accumulo/issues/3674
+// where a failing minor compaction causes a Tablet.initiateClose to leave the
+// Tablet in a half-closed state. The half-closed Tablet cannot be unloaded and
+// the TabletServer cannot be shutdown normally. Because the minor compaction has
+// been failing the Tablet needs to be recovered when it's ultimately re-assigned.
+//
+public class HalfClosedTabletIT extends SharedMiniClusterBase {
+
+  public static class HalfClosedTabletITConfiguration implements MiniClusterConfigurationCallback {
+
+    @Override
+    public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
+      cfg.setNumTservers(1);
+    }
+
+  }
+
+  @BeforeAll
+  public static void startup() throws Exception {
+    SharedMiniClusterBase.startMiniClusterWithConfig(new HalfClosedTabletITConfiguration());
+  }
+
+  @AfterAll
+  public static void shutdown() throws Exception {
+    SharedMiniClusterBase.stopMiniCluster();
+  }
+
+  @Test
+  public void testSplitWithInvalidContext() throws Exception {
+
+    // In this scenario the table has been mis-configured with an invalid context name.
+    // The minor compaction task is failing because classes like the volume chooser or
+    // user iterators cannot be loaded. The user calls Tablet.split which calls initiateClose.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the split call with an invalid context. The context property is removed after the
+    // split call below to get the minor compaction task to succeed on a subsequent run. Because
+    // the minor compaction task backs off when retrying, this could take some time.
+
+    String tableName = getUniqueNames(1)[0];
+
+    try (final var client = Accumulo.newClient().from(getClientProps()).build()) {
+      final var tops = client.tableOperations();
+      tops.create(tableName);
+      TableId tableId = TableId.of(tops.tableIdMap().get(tableName));
+      try (final var bw = client.createBatchWriter(tableName)) {
+        final var m1 = new Mutation("a");
+        final var m2 = new Mutation("b");
+        m1.put(new Text("cf"), new Text(), new Value());
+        m2.put(new Text("cf"), new Text(), new Value());
+        bw.addMutation(m1);
+        bw.addMutation(m2);
+      }
+
+      setInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+          tableId);
+
+      // Need to wait for TabletServer to pickup configuration change
+      Thread.sleep(3000);
+
+      Thread configFixer = new Thread(() -> {
+        UtilWaitThread.sleep(3000);
+        removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(),
+            tableId);
+      });
+
+      long t1 = System.nanoTime();
+      configFixer.start();
+
+      // The split will probably start running w/ bad config that will cause it to get stuck.
+      // However once the config is fixed by the background thread it should continue.
+      tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))));
+
+      long t2 = System.nanoTime();
+      // expect that split took at least 3 seconds because that is the time it takes to fix the
+      // config
+      assertTrue(TimeUnit.NANOSECONDS.toMillis(t2 - t1) >= 3000);
+
+      // offline the table which will unload the tablets. If the context property is not
+      // removed above, then this test will fail because the tablets will not be able to be
+      // unloaded
+      tops.offline(tableName);
+
+      Wait.waitFor(() -> countHostedTablets(client, tableId) == 0L, 340_000);
+    }
+  }
+
+  @Test
+  public void testIteratorThrowingTransientError() throws Exception {
+
+    // In this scenario a minc iterator throws an error some number of time, then
+    // succeeds. We want to verify that the minc is being retried and the tablet
+    // can be closed.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      IteratorSetting setting = new IteratorSetting(50, "error", ErrorThrowingIterator.class);
+      setting.addOption(ErrorThrowingIterator.TIMES, "3");
+      c.tableOperations().attachIterator(tableName, setting, EnumSet.of(IteratorScope.minc));
+      c.tableOperations().compact(tableName, new CompactionConfig().setWait(true));
+
+      // Taking the table offline should succeed normally
+      tops.offline(tableName);
+
+      // Minc should have completed successfully
+      Wait.waitFor(() -> tabletHasExpectedRFiles(c, tableName, 1, 1, 1, 1), 340_000);
+
+      Wait.waitFor(() -> countHostedTablets(c, tid) == 0L, 340_000);
+
+    }
+  }
+
+  // Note that these tests can talk several minutes each because by the time the test
+  // code changes the configuration, the minc has failed so many times that the minc
+  // is waiting for a few minutes before trying again. For example, I saw this backoff
+  // timing:
+  //
+  // DEBUG: MinC failed sleeping 169 ms before retrying
+  // DEBUG: MinC failed sleeping 601 ms before retrying
+  // DEBUG: MinC failed sleeping 2004 ms before retrying
+  // DEBUG: MinC failed sleeping 11891 ms before retrying
+  // DEBUG: MinC failed sleeping 43156 ms before retrying
+  // DEBUG: MinC failed sleeping 179779 ms before retrying
+  @Test
+  public void testBadIteratorOnStack() throws Exception {
+
+    // In this scenario the table is using an iterator for minc that is throwing an exception.
+    // This test ensures that the Tablet can still be unloaded normally by taking if offline
+    // after the bad iterator has been removed from the minc configuration.
+
+    try (AccumuloClient c = Accumulo.newClient().from(getClientProps()).build()) {
+
+      String tableName = getUniqueNames(1)[0];
+      final var tops = c.tableOperations();
+
+      tops.create(tableName);
+      final var tid = TableId.of(tops.tableIdMap().get(tableName));
+
+      IteratorSetting is = new IteratorSetting(30, BadIterator.class);
+      c.tableOperations().attachIterator(tableName, is, EnumSet.of(IteratorScope.minc));
+
+      try (BatchWriter bw = c.createBatchWriter(tableName)) {
+        Mutation m = new Mutation(new Text("r1"));
+        m.put("acf", tableName, "1");
+        bw.addMutation(m);
+      }
+
+      c.tableOperations().flush(tableName, null, null, false);
+
+      UtilWaitThread.sleepUninterruptibly(30, TimeUnit.SECONDS);

Review Comment:
   Its not expected that the assertion after the sleep will fail no matter how long we sleep.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on pull request #3677: Attempt to prevent half-closed Tablet due to failing minc

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.
dlmarion commented on PR #3677:
URL: https://github.com/apache/accumulo/pull/3677#issuecomment-1675396938

   > > It would be good in the future to add some metrics or something that can be checked in the test to confirm that failures do happen on the minc before the configuration is corrected.
   > 
   > Wondering if a metric that counts failed minor compactions would be useful. Should normally be zero.
   
   Yeah, that's what I'm thinking. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org