You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/11/01 14:36:00 UTC

[GitHub] [accumulo] keith-turner commented on a change in pull request #2329: Attempt at failing FaTE transactions that are blocking this FaTE transaction from the write lock

keith-turner commented on a change in pull request #2329:
URL: https://github.com/apache/accumulo/pull/2329#discussion_r740260185



##########
File path: core/src/main/java/org/apache/accumulo/fate/zookeeper/DistributedReadWriteLock.java
##########
@@ -218,22 +237,54 @@ public boolean tryLock() {
       }
       SortedMap<Long,byte[]> entries = qlock.getEarlierEntries(entry);
       Iterator<Entry<Long,byte[]>> iterator = entries.entrySet().iterator();
-      if (!iterator.hasNext())
+      if (!iterator.hasNext()) {
         throw new IllegalStateException("Did not find our own lock in the queue: " + this.entry
             + " userData " + new String(this.userData, UTF_8) + " lockType " + lockType());
-      return iterator.next().getKey().equals(entry);
+      }
+      if (!failBlockers) {
+        return iterator.next().getKey().equals(entry);
+      } else {
+        ZooStore<DistributedReadWriteLock> zs;
+        try {
+          zs = new ZooStore<>(zooPath, zrw);
+        } catch (KeeperException | InterruptedException e1) {
+          log.error("Error creating zoo store", e1);
+          return false;
+        }
+        final AdminUtil<DistributedReadWriteLock> util = new AdminUtil<>();
+        boolean result = true;
+        while (iterator.hasNext()) {
+          Entry<Long,byte[]> e = iterator.next();
+          if (!e.getKey().equals(entry)) {
+            result &= util.prepFail(zs, zrw, zooManagerPath, Long.toString(e.getKey(), 16));

Review comment:
       I see two problems with the call here.  First prepFail() calls  checkGlobalLock() which calls System.exit() when the constuctor AdminUtil<>() is called.  The reason it calles checkGlobalLock is because I think this code was only intended to be called when the manager was not running and no repos were running.  The Second problem is that prepFail only initiates failure, it does not wait for any operations that may currently be actively executing and holding the lock to finish.  This is why prepFail checks that the Manager is not running, because it does not deal with this concurrency issue any way.  So maybe something could get the lock by failing other operations, but those failed operation could still be running and mutating the metadata table and zookeeper after the lock is acquired.
   
   The first problem is easy to address by calling a diff constructor, but the 2nd problem would require some new mechanism to initiate failure and wait for anything that may be running. Since all fate ops are run in the manager could have something that does the following.
   
   - Prevents repos related to failed txids from starting to run (like if it was sitting a thread pool queue)
   - Waits until repos related to failed txids that are currently running complete. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org