You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/11/24 14:55:20 UTC

[GitHub] [accumulo] tynyttie opened a new pull request #1803: Accumulo ticket#1303 - Explore creating unique directories for sorted WAL recovery

tynyttie opened a new pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803


   Added UniqueNameAllocator() to walPath in RecoveryPath.java. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755140353


   FWIW, on the main branch, the following command to run the single RestartIT by itself completes in about 10 minutes on my laptop. It seems to hang indefinitely with the unique name stuff added, possibly continuously looking for recovery tasks to work on, but I'm not sure yet.
   
   ```bash
   mvn clean verify -PskipQA -DskipTests=false -DskipITs=false -Dit.test=RestartIT -Dtest=nosuchtest
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r561354068



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +47,14 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + '-' + context.getInstanceID());

Review comment:
       You're not missing anything. You're correct. This does not make it unique per tserver process.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755141790


   Another reason it might pass with a fixed string, but not a unique string generator, is that the unique string might need to be consistent (at least, within a process), if different parts of the recovery code are getting the path at different times.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r561354068



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +47,14 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + '-' + context.getInstanceID());

Review comment:
       You're not missing anything. You're correct. This does not make it unique per tserver process.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-738149636


   > @ctubbsii On my side, I am not seeing the test fail when I run RestartIT.
   
   Try this command in your branch: `mvn clean verify -Dtest=blah -Dit.test=RestartIT -PskipQA -DskipTests=false -DskipITs=false` and then watch the logs in `test/target/mini-tests/org.apache.accumulo.test.functional.RestartIT_restartMasterRecovery/logs/*` for the `ERROR` line with `LogSorter` while that is running.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-738044326


   Looking into it now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568160420



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -50,7 +47,8 @@ public static Path getRecoveryPath(Path walPath, String portHost) {
 
       // drop wal
       walPath = walPath.getParent();
-      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + convertToSha1(portHost));
+      walPath =
+          new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + DigestUtils.sha1Hex(portHost));

Review comment:
       I believe there's a variant of the sha256 method in DigestUtils that returns a hex-encoded string, that can simply be truncated directly, rather than using BigInteger, which shouldn't be necessary.
   
   However, I'm not sure that this hashing of a location strategy is going to work at all, based on the conversation above in https://github.com/apache/accumulo/pull/1803#discussion_r568151689




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-754800211


   Converting to draft for now, because it seems like the test failure issue is a bit difficult to track down.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r567991390



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +50,46 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + convertToSha1(portHost));

Review comment:
       Got it. Thank you. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756701622


   I like that solution @ctubbsii . It should be pretty simple to use the name of the tserver as the unique id. 
   
   I was also thinking of making `RecoveryPath.java` store all the recovery directories it creates so the other classes trying to get the recovery path wouldn't have too. Though if we go with your solution then that wouldn;t be necessary. I will leave that up to @tynyttie. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756239449


   I can foresee another issue possibly happening in the recover function of [TabletServer.java](https://github.com/apache/accumulo/blob/e510639958f442cd579b5e7e7fd2971b025dfae0/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L1166). It also calls the same getRecoveryPath function and it will return a new directory each time. Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756216008


   I think I have found the issue, currently working on trying to come up with a solution. It appears everything does recover as expected (each recovery file contains the finished marker) but it doesn't properly remove it from the queue in `RecoveryManager`. This is because the destination path gets overridden with a new unique name so when it checks for that marker, it doesn't find it. At least I think that is how everything is supposed to work.
   
   What I see now in hdfs are multiple recovery directories with the same walog UUID and all with the finished markers. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
EdColeman commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r560252183



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -171,6 +171,8 @@ public boolean recoverLogs(KeyExtent extent, Collection<Collection<String>> walo
         String dest =
             RecoveryPath.getRecoveryPath(new Path(filename), master.getContext()).toString();
 
+        log.debug("This is walPath in RecoveryManager: " + filename);
+

Review comment:
       You could use parameter replacement in log statements
   `log.debug("This is walPath in RecoveryManager: {}", filename);`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568031398



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       Do you have an idea on how to properly implement this? `RecoveryManager.recoverLogs` gets called in `tabletGroupWatcher.run` when the tablets are not hosted and there are walogs. From testing, `tls.current` is null at this point. We could use `tls.last` but I am unsure if last will always be set and is prone to change.  Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755036001


   Process I used for debugging this issue. Note doing this process I tailed the logs for new info. 
   1. In RecoveryPath.java, I hard coded a string and notice that RestartIt passed with no issues. 
   Example: walPath = Path(walPath, FileType.RECOVERY.getDirectory() + '-' + 'addedFileName');
   
   2. In RecoveryPath.java., I used StringBuffer, StringWriter, and normal String creation to build a new string. 
   Example: testSting = FileType.RECOVERY.getDirectory() + '-' 'namer.getNextName();
                   Path newPath = new Path(walPath, testString); ------ RestartIT failed with this approach. 
   3. In RecoveryPath.java, instead of using UniqueNameAllocator to generate the name, I create a new name generate in order to make sure the name was present when needed. ---  RestartIT failed with this approach.
   4. I used Jstack to help track down the cause of threads stuck in wait state. With this approach, I couldn't pinpoint the true origin.
   5. I connected to the server and added break statement to walk through the code. This approach was unsuccessful. None of my break statements was picked up by the Tserver. 
   6. In LogSorter, I added a sleep where the issue is to make sure the process doesn't end. This approach was unsuccessful. 
       The issues are on: (LogSorter.java:197) and (LogSorter.java:167)
    
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771595094


   > Since the original issue in #1303 was to "explore" the idea of creating unique directories, and a solution hasn't been identified in over 2 months, I'm inclined to think that this has been "explored" enough, and we should probably just close this, and the original issue #1303 , rather than continue to pursue it. ... unless somebody has some new strategy ready-to-try. Thoughts? Is #1303 worth continuing to pursue or no?
   
   FWIW, I do think this is possible to implement. Either with the current mechanism, someting similar, or a rewrite of how we do the recovery process currently. Just depends on much this feature is desired and beneficial. Especially on a large scale cluster. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755388636


   > Based on what I am seeing, the scan function in `ThriftScanner` will always error with "Failed to locate tablet for table : 1 row : 
   
   The failure in the scanner (client side) and retry is expected if the tablet isn't loaded yet because it has not yet completed recovery. What needs to be determined is what is holding up the recovery on the server-side. Once the tablet is recovered and loaded on the server side, the scanner will be able to find it without issue, and read from it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568151689



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       @Manno15 If the question is directed at me: I don't, not without digging into this to come up with a solution myself. If `tls.last` isn't guaranteed to be the tserver that is going to be assigned this, and performing the recovery, then that would not be appropriate either.
   
   It seems to me that the basic requirement is that the manager must decide the destination path, then somehow communicate that to the tserver that was assigned to do the sorting work. If that tserver doesn't finish, then the manager must decide on a new destination path, and communicate that to the next tserver requested to do the work (possibly the same tserver, if it came back online). For any given work order to sort the logs, the tserver must use the directory specified, and presumably, the manager must be able to know which ones have succeeded and which have failed. But, I'm really not that familiar with the recovery process, and would have to do a lot of digging to understand this more.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r533688173



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -42,14 +48,14 @@ public static Path getRecoveryPath(Path walPath) {
       // drop wal
       walPath = walPath.getParent();
 
-      walPath = new Path(walPath, FileType.RECOVERY.getDirectory());
+      walPath = new Path(walPath,
+          FileType.RECOVERY.getDirectory() + '-' + context.getUniqueNameAllocator().getNextName());
+      log.debug("Unique Name Allocated:  " + walPath);

Review comment:
       ```suggestion
         log.debug("Directory selected for WAL recovery:  " + walPath);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568151689



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       @Manno15 If the question is directed at me: I don't, not without digging into this to come up with a solution myself.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771595094


   > Since the original issue in #1303 was to "explore" the idea of creating unique directories, and a solution hasn't been identified in over 2 months, I'm inclined to think that this has been "explored" enough, and we should probably just close this, and the original issue #1303 , rather than continue to pursue it. ... unless somebody has some new strategy ready-to-try. Thoughts? Is #1303 worth continuing to pursue or no?
   
   FWIW, I do think this is possible to implement. Either with the current mechanism, something similar, or a rewrite of how we do the recovery process currently. Just depends on much this feature is desired and beneficial. Especially on a large scale cluster. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-733098104


   Thanks for the PR @tynyttie ; It looks like the build is failing due to formatting issues, so the tests aren't being run. Please run `mvn clean package -DskipTests` locally on your branch to format the files, and commit the changes to your branch. Then, we will be able to see the test results from in the PR checks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r529776670



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -18,13 +18,19 @@
  */
 package org.apache.accumulo.server.master.recovery;
 
+import org.apache.accumulo.server.ServerContext;
+import org.apache.accumulo.server.cli.ServerUtilOpts;
 import org.apache.accumulo.server.fs.VolumeManager.FileType;
 import org.apache.hadoop.fs.Path;
 
 public class RecoveryPath {
 
   // given a wal path, transform it to a recovery path
   public static Path getRecoveryPath(Path walPath) {
+
+    ServerUtilOpts opts = new ServerUtilOpts();
+    ServerContext context = opts.getServerContext();

Review comment:
       This isn't the right way to get the ServerContext. The caller should pass in the context reference that they have, as an additional parameter to this method. I see two callers for this method. `TabletServer` should have a `getContext()` method on it, and `RecoveryManager` can use `master.getContext()`.

##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -42,7 +48,8 @@ public static Path getRecoveryPath(Path walPath) {
       // drop wal
       walPath = walPath.getParent();
 
-      walPath = new Path(walPath, FileType.RECOVERY.getDirectory());
+      walPath = new Path(walPath,
+          FileType.RECOVERY.getDirectory() + '/' + context.getUniqueNameAllocator().getNextName());

Review comment:
       It would be good to log a message about the name that was chosen, before being used here, as a debug message, so that a an admin can trace the specific recovery directory back to a specific server from the server logs if necessary.
   
   Instead of using `'/'`, you can use `'-'` so that way the directories look like `recovery-<uniquePart>`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-738044326


   Looking into now.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-761071982


   That is correct @milleruntime. Some ideas have been thrown around on how to fix it but implementing it has been more difficult since `TabletServer.java` also uses `RecoveryPath.java` for recovering tablets. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771595094


   > Since the original issue in #1303 was to "explore" the idea of creating unique directories, and a solution hasn't been identified in over 2 months, I'm inclined to think that this has been "explored" enough, and we should probably just close this, and the original issue #1303 , rather than continue to pursue it. ... unless somebody has some new strategy ready-to-try. Thoughts? Is #1303 worth continuing to pursue or no?
   
   FWIW, I do think this is possible to implement. Either with the current mechanism, something similar, or a rewrite of how we do the recovery process currently. Just depends on much this feature is desired and beneficial. Especially on a large scale cluster. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r560268377



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -171,6 +171,8 @@ public boolean recoverLogs(KeyExtent extent, Collection<Collection<String>> walo
         String dest =
             RecoveryPath.getRecoveryPath(new Path(filename), master.getContext()).toString();
 
+        log.debug("This is walPath in RecoveryManager: " + filename);
+

Review comment:
       I see you updated this log statement in the latest commit. Looks like there may be a couple others spots that could be changed to use this as well for the sake of consistency.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie closed pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie closed pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756216008


   I think I have found the issue and I am currently working on trying to come up with a solution. It appears everything does recover as expected (each recovery file contains the finished marker) but it doesn't properly remove it from the queue in `RecoveryManager`. This is because the destination path gets overridden with a new unique name so by the time it checks for that marker, it doesn't find it. At least I think that is how everything is supposed to work.
   
   What I see now in hdfs are multiple recovery directories with the same walog UUID and all with the finished markers. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756405644


   Another option for having each tserver use a unique directory, is to make it unique and non-random, so it is reproducible. So, if the tserver had a unique id for its lifecycle, it could just append that and it would be consistent on multiple calls to the method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ivakegg commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ivakegg commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r561293553



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +47,14 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + '-' + context.getInstanceID());

Review comment:
       Perhaps I am missing something here, but isn't context.getInstanceID() the accumulo instance id?  How does this make the directory any more unique?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
EdColeman commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756262056


   Part of this reminds me of https://github.com/apache/accumulo/issues/961 - I don't think it is exactly the same, but maybe that description / solution provides pointers for things to look for? Or not....


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771595094


   > Since the original issue in #1303 was to "explore" the idea of creating unique directories, and a solution hasn't been identified in over 2 months, I'm inclined to think that this has been "explored" enough, and we should probably just close this, and the original issue #1303 , rather than continue to pursue it. ... unless somebody has some new strategy ready-to-try. Thoughts? Is #1303 worth continuing to pursue or no?
   
   FWIW, I do think this is possible to implement. Either with the current mechanism, someting similar, or a rewrite of how we do the recovery process currently. Just depends on much this feature is desired and beneficial. Especially on a large scale cluster. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r567992140



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       Size can't be less than 0.
   
   ```suggestion
     public String getHostPort() {
       Set<TServerInstance> tserverInstances = master.onlineTabletServers();
       return tserverInstances.isEmpty() ? null : tserverInstances.stream().findFirst().get().toString();
     }
   
   ```
   
   However, this won't work either, because this doesn't return the current tserver's host and port, it returns whichever the master.onlineTabletServers() returns first.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771883813


   It occurs to me that even if we could get this working, during recovery it would be possible that it could repeatedly fail and each new attempt would fill DFS even more as it kept trying. Given the increasing complexity of this issue, and the potential for downsides, with the upside being not entirely clear, I'm thinking that this issue has been explored enough, and this, along with #1303 should be closed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756701622


   I like that solution @ctubbsii . It should be pretty simple to use the name of the tserver as the unique id. 
   
   I was also thinking of making `RecoveryPath.java` store all the recovery directories it creates so the other classes trying to get the recovery path wouldn't have too. Though if we go with your solution then that wouldn;t be necessary. I will leave that up to @tynyttie. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r567992140



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       Size can't be less than 0.
   
   ```suggestion
     public String getHostPort() {
       Set<TServerInstance> tserverInstances = master.onlineTabletServers();
       return tserverInstances.isEmpty() ? null : tserverInstances.stream().findFirst().get().toString();
     }
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756216008


   I think I have found the issue and I am currently working on trying to come up with a solution. It appears everything does recover as expected (each recovery file contains the finished marker) but it doesn't properly remove it from the queue in `RecoveryManager`. This is because the destination path gets overridden with a new unique name so when it checks for that marker, it doesn't find it. At least I think that is how everything is supposed to work.
   
   What I see now in hdfs are multiple recovery directories with the same walog UUID and all with the finished markers. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-736788610


   I'm going to run the full ITs on this change, but I don't anticipate any issues with this. Thanks, @tynyttie !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r560260663



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -171,6 +171,8 @@ public boolean recoverLogs(KeyExtent extent, Collection<Collection<String>> walo
         String dest =
             RecoveryPath.getRecoveryPath(new Path(filename), master.getContext()).toString();
 
+        log.debug("This is walPath in RecoveryManager: " + filename);
+

Review comment:
       Thank you EdColeman, I am making the change now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755374097


   Based on what I am seeing, the scan function in `ThriftScanner` will always error with "Failed to locate tablet for table : 1 row : row_0000000000". Then it pauses and will ultimately retry. [Here is the scan function that I am seeing the breakpoints hit](https://github.com/apache/accumulo/blob/44325585c2bd8aed0a3ac87a84f44e7b6755d435/core/src/main/java/org/apache/accumulo/core/clientImpl/ThriftScanner.java#L239)
   
   I will hardcode the path like @tynyttie did and see if I notice different behavior in that function. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
milleruntime edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-745345124


   FYI the error that @ctubbsii referenced `java.io.IOException: Stream is closed!` I think is fairly "normal" behavior for recovery.  There is no way to close the WAL properly if the tserver is killed so during recovery, when we read the WAL and try to close it, this is the error that is seen.  This error is also seen when running RestartIT in the main branch.  There must be something else going on for the test to time out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771883813


   It occurs to me that even if we could get this working, during recovery it would be possible that it could repeatedly fail and each new attempt would fill DFS even more as it kept trying. Given the increasing complexity of this issue, and the potential for downsides, with the upside being not entirely clear, I'm thinking that this issue has been explored enough, and this, along with #1303 should be closed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756701622


   I like that solution @ctubbsii. It should be pretty simple to use the name of the tserver as the unique id. It also seems that your solution is more inlined with what the original issue is discussing. 
   
   I was also thinking of making `RecoveryPath.java` store all the recovery directories it creates so the other classes trying to get the recovery path wouldn't have too. Though if we go with your solution then that wouldn;t be necessary. I will leave that up to @tynyttie. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-745345124


   FYI the error that @ctubbsii referenced `java.io.IOException: Stream is closed!` I think is fairly "normal" behavior for recovery.  There is usually no way to close the WAL properly if the tserver is killed so when we read the WAL and try to close it, this is the error that is seen.  This error is also seen when running RestartIT in the main branch.  There must be something else going on for the test to time out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755139190


   Thanks, @tynyttie ; I spent some time looking into this, and realized that the "Stream is closed" error messages are a bit of a red herring. I've created a PR to fix those in #1856 .
   
   Although I still haven't figured out why the changes in this PR are causing RestartIT to hang, I suspect it has something to do with the fact that it's doing recovery using a RawLocalFileSystem, which might behave differently than the default implementation of the local FileSystem.
   
   In addition, I found a deficiency in this PR's code. It does not update the SimpleGarbageCollector code that looks for recovery files to clean up. The following patch is my attempt to fix that, but because of the hanging issue of RestartIT, I'm not sure if it will be sufficient.
   
   <details>
   <summary>Click to expand/collapse patch for SimpleGarbageCollector</summary>
   
   ```patch
   diff --git a/server/base/src/main/java/org/apache/accumulo/server/ServerConstants.java b/server/base/src/main/java/org/apache/accumulo/server/ServerConstants.java
   index ae6c0da8c7..1e357e5af5 100644
   --- a/server/base/src/main/java/org/apache/accumulo/server/ServerConstants.java
   +++ b/server/base/src/main/java/org/apache/accumulo/server/ServerConstants.java
   @@ -166,8 +166,8 @@ public class ServerConstants {
        return prefix(getBaseUris(context), TABLE_DIR);
      }
    
   -  public static Set<String> getRecoveryDirs(ServerContext context) {
   -    return prefix(getBaseUris(context), RECOVERY_DIR);
   +  public static Set<String> getRecoveryDirPatterns(ServerContext context) {
   +    return prefix(getBaseUris(context), RECOVERY_DIR + "-*");
      }
    
      public static Path getInstanceIdLocation(Volume v) {
   diff --git a/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java b/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java
   index 40cd9d180d..da432854b0 100644
   --- a/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java
   +++ b/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java
   @@ -393,10 +393,9 @@ public class GarbageCollectWriteAheadLogs {
      protected Map<UUID,Path> getSortedWALogs() throws IOException {
        Map<UUID,Path> result = new HashMap<>();
    
   -    for (String dir : ServerConstants.getRecoveryDirs(context)) {
   -      Path recoveryDir = new Path(dir);
   -      if (fs.exists(recoveryDir)) {
   -        for (FileStatus status : fs.listStatus(recoveryDir)) {
   +    for (String dir : ServerConstants.getRecoveryDirPatterns(context)) {
   +      for (FileStatus recoveryDir : fs.globStatus(new Path(dir))) {
   +        for (FileStatus status : fs.listStatus(recoveryDir.getPath())) {
              try {
                UUID logId = path2uuid(status.getPath());
                result.put(logId, status.getPath());
   ```
   
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-771183971


   Since the original issue in #1303 was to "explore" the idea of creating unique directories, and a solution hasn't been identified in over 2 months, I'm inclined to think that this has been "explored" enough, and we should probably just close this, and the original issue #1303 , rather than continue to pursue it. ... unless somebody has some new strategy ready-to-try. Thoughts? Is #1303 worth continuing to pursue or no?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
EdColeman commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568086238



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -50,7 +47,8 @@ public static Path getRecoveryPath(Path walPath, String portHost) {
 
       // drop wal
       walPath = walPath.getParent();
-      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + convertToSha1(portHost));
+      walPath =
+          new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + DigestUtils.sha1Hex(portHost));

Review comment:
       Recently it was noted that using sha1 can be flagged by some code security scan utilities - would using sha256 or even sha512 eliminate those utilities from flagging "unsecure usages"?  This is not a security related hash so it does not matter, but those utilities don't take that into account.
   
   The length of the string could also be shortened using something like:
   
   ```String s = new BigInteger(DigestUtils.sha256(portHost).toString(MAX_RADIX).substring(0,40)```
   
   This would have the same number of characters as the sha1 digest, but more "bits" using base36 vs base16 - or the string length could be shortened so that it was equivalent to the 40 digit sha1 without increasing the (small) risk of a collision. This also would allow using sha512 without a ridiculous string length.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756701622


   I like that solution @ctubbsii. It should be pretty simple to use the name of the tserver as the unique id. It also seems that your solution is more inlined with what the original issue is discussing. 
   
   I was also thinking of making `RecoveryPath.java` store all the recovery directories it creates so the other classes trying to get the recovery path wouldn't have too. Though if we go with your solution then that wouldn;t be necessary. I will leave that up to @tynyttie. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-756216008


   I think I have found the issue and I am currently working on trying to come up with a solution. It appears everything does recover as expected (each recovery file contains the finished marker) but it doesn't properly remove it from the queue in [RecoveryManager.java](https://github.com/apache/accumulo/blob/e510639958f442cd579b5e7e7fd2971b025dfae0/server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L172). This is because the destination path gets overridden with a new unique name so by the time it checks for that marker, it doesn't find it. At least I think that is how everything is supposed to work.
   
   What I see now in hdfs are multiple recovery directories with the same walog UUID and all with the finished markers. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755391595


   That does make a lot of sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-738064714


   @ctubbsii On my side, I am not seeing the test fail when I run RestartIT. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-761087325


   I think using the name of the tserver as the unique sub directory under the WAL directory would work.  As an added benefit, when the RecoveryManager sees that dir at the beginning of `recoveryLogs` it could also check if that tserver is still alive (coming from the master so this may be as simple as looking in a list of tservers).  It would still be possible that the tserver is alive and the thread died/hung or it was restarted so that sort process is gone.  But we could print warnings for this case and start another sort either way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] tynyttie commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
tynyttie commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-738158991


   Ok.Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-761136440


   > I think using the name of the tserver as the unique sub directory under the WAL directory would work. As an added benefit, when the RecoveryManager sees that dir at the beginning of `recoveryLogs` it could also check if that tserver is still alive (coming from the master so this may be as simple as looking in a list of tservers). It would still be possible that the tserver is alive and the thread died/hung or it was restarted so that sort process is gone. But we could print warnings for this case and start another sort either way.
   
   The hiccup I can see with this is that I don't know what "name of the tserver" is. The most common identifier for a tserver is the hostname and port, but these are not necessarily unique (tservers fail and restart on the same host/port). I was actually just thinking about the tserver generating a unique ID internally when it starts, that can be used to make unique directories. The unique ID could be logged, or could be sent in hello messages to the manager server, or stored in Zookeeper locks, for additional uses, but for this issue, it would be enough to just log the directory that it created to be used for recovery.
   
   Once the uniqueness issue is ironed out, with unique names, the SimpleGarbageCollector still needs to be updated and tested to ensure it works properly, cleaning up the old directories. I mocked up a patch above for that, but it has not been tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r567989242



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +50,46 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + "-" + convertToSha1(portHost));

Review comment:
       commons-codec, which is one of our dependencies, has a method, `DigestUtils.sha1Hex(string)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-755374097


   Based on what I am seeing, the scan function in `ThriftScanner` will always error with "Failed to locate tablet for table : 1 row : row_0000000000". Then it pauses and will ultimately retry. [Here is the scan function that I am seeing the breakpoints hit](https://github.com/apache/accumulo/blob/44325585c2bd8aed0a3ac87a84f44e7b6755d435/core/src/main/java/org/apache/accumulo/core/clientImpl/ThriftScanner.java#L239)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568523204



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       > @Manno15 If the question is directed at me: I don't, not without digging into this to come up with a solution myself. If tls.last isn't guaranteed to be the tserver that is going to be assigned this, and performing the recovery, then that would not be appropriate either.
   
   I am not sure yet if tls.last is guranteed. I or @tynyttie  would have to do more research on a bigger cluster to get to the bottom of this. 
   
   > it seems to me that the basic requirement is that the manager must decide the destination path, then somehow communicate that to the tserver that was assigned to do the sorting work. 
   
   Based on what I am reading, the Tserver side of recover is only called in `Tablet.java`. This is after `RecoveryManager` is finishing do its side of te work. Not sure how everything fits in to get the full picture yet. 
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-761068173


   I think I figured out the issue with this change.  The problem is with how `RecoveryManager.recoverLogs()` works.  The first time the Master will call `RecoveryManager.recoverLogs()` it will generate `dest` (call it dest1) using `RecoveryPath.getRecoveryPath()`.  RecoveryManager will check for a "finished" file at this location, which won't exist yet and kick off the log sorting process.  The LogSorter job will complete and put a "finished" marker at dest1.  The next time the Master tries to assign the Tablet, it will call `RecoveryManager.recoverLogs()` and generate a second `dest` (call it dest2).  RecoveryManager will check for a "finished" file at dest2, won't see it and start another log sort process.  The "finished" file is never seen by the next call of `RecoveryManager.recoverLogs()` and the log sorting process will continue indefinitely. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ivakegg commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
ivakegg commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r561293553



##########
File path: server/base/src/main/java/org/apache/accumulo/server/master/recovery/RecoveryPath.java
##########
@@ -41,15 +47,14 @@ public static Path getRecoveryPath(Path walPath) {
 
       // drop wal
       walPath = walPath.getParent();
+      walPath = new Path(walPath, FileType.RECOVERY.getDirectory() + '-' + context.getInstanceID());

Review comment:
       Perhaps I am missing something here, but isn't context.getInstanceID() the accumulo instance id?  How does this make the directory any more unique?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#issuecomment-763612129


   > The hiccup I can see with this is that I don't know what "name of the tserver" is. The most common identifier for a tserver is the hostname and port, but these are not necessarily unique (tservers fail and restart on the same host/port).
   
   I think that is fine for recovery sorting.  If a tserver fails or restarts, then I don't think we want to restart the sort anyway.  I don't know if it is possible or worth the trouble of trying to recover the sort recovery.  I was thinking hostname and port was enough to indicate a sort was kicked off and by which server.  We also wouldn't need additional code to store the unique ID if we use hostname and port.
   
   Good point about updating the GC.  I think we may also need to introduce an age off property (or reuse another) to know when we should give up waiting for a sort to finish.  I am not sure how long we would want to wait for the sort to finish.  Waiting too long will prevent Tablet's from loading but larger clusters can take hours to recovery.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on a change in pull request #1803: Fix #1303 Create unique dirs for sorted WALs

Posted by GitBox <gi...@apache.org>.
Manno15 commented on a change in pull request #1803:
URL: https://github.com/apache/accumulo/pull/1803#discussion_r568523204



##########
File path: server/manager/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java
##########
@@ -148,6 +149,18 @@ private boolean exists(final Path path) throws IOException {
     }
   }
 
+  public String getHostPort() {
+
+    String hostPort = "";
+    Set<TServerInstance> tserverInstances = master.onlineTabletServers();
+
+    if (tserverInstances.isEmpty() && tserverInstances.size() < 0) {
+      return hostPort = null;
+
+    } else
+      return hostPort = tserverInstances.stream().findFirst().get().toString();
+  }
+

Review comment:
       > @Manno15 If the question is directed at me: I don't, not without digging into this to come up with a solution myself. If tls.last isn't guaranteed to be the tserver that is going to be assigned this, and performing the recovery, then that would not be appropriate either.
   
   I am not sure yet if tls.last is guranteed. I or @tynyttie  would have to do more research on a bigger cluster to get to the bottom of this. 
   
   > it seems to me that the basic requirement is that the manager must decide the destination path, then somehow communicate that to the tserver that was assigned to do the sorting work. 
   
   Based on what I am reading, the Tserver side of recover is only called in `Tablet.java`. This is after `RecoveryManager` is finishing do its side of te work. Not sure how everything fits in to get the full picture yet. 
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org