You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by GitBox <gi...@apache.org> on 2023/01/03 15:18:27 UTC

[GitHub] [ratis] SzyWilliam opened a new pull request, #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

SzyWilliam opened a new pull request, #802:
URL: https://github.com/apache/ratis/pull/802

   ## What changes were proposed in this pull request?
   
   In grpcLogAppender, when installing snapshot from leader and fails (corrupted MD5 or other reason), the log-appender can't automatically recover from this failure in the subsequent retry attempts. 
   The retry attempt will fail because the `createDirectories` [here](https://github.com/apache/ratis/blob/1b3c61d5dbcdb2b1ac5fd49a624a6cb88ab04744/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java#L105) will throw a file already exist exception. This PR add a if-clause to create directory only when the directory does not exists.
   
   
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-1764
   
   ## How was this patch tested?
   1. unit tests
   2. manual test by simulating a failure scene and to see if the retry result is successful.
   
   
   ## Appendix
   exception stack
   ```java
   2022-12-19 20:26:15,071 [grpc-default-executor-4936] ERROR o.a.r.s.i.SnapshotInstallationHandler:96 - 5@group-00010000001E: installSnapshot failed
   org.apache.ratis.io.CorruptedFileException: File /data/liuzhen_test/master_1216_d426f7a/data/datanode/data/snapshot/.tmp.group-00010000001E/snapshot-33b65238-075a-4ee4-8723-014e04f80da8/66_158230/sequence/root.test.g_3/30/2538/1671443330163-38-0-0.tsfile.resource (exist? false, length=0) is corrupted: MD5 mismatch for snapshot-158230 installation.  Renamed temporary snapshot file /data/liuzhen_test/master_1216_d426f7a/data/datanode/data/snapshot/.tmp.group-00010000001E/snapshot-33b65238-075a-4ee4-8723-014e04f80da8/66_158230/sequence/root.test.g_3/30/2538/1671443330163-38-0-0.tsfile.resource to /data/liuzhen_test/master_1216_d426f7a/data/datanode/data/snapshot/.tmp.group-00010000001E/snapshot-33b65238-075a-4ee4-8723-014e04f80da8/66_158230/sequence/root.test.g_3/30/2538/1671443330163-38-0-0.tsfile.resource.corrupt20221219-202615_070
           at org.apache.ratis.server.storage.SnapshotManager.installSnapshot(SnapshotManager.java:155)
           at org.apache.ratis.server.impl.ServerState.installSnapshot(ServerState.java:480)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.checkAndInstallSnapshot(SnapshotInstallationHandler.java:181)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:120)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:94)
           at org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1517)
           at org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:640)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:242)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:239)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:124)
           at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
           at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
           at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
           at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
           2022-12-19 20:26:50,344 [grpc-default-executor-4939] ERROR o.a.r.s.i.SnapshotInstallationHandler:96 - 5@group-00010000001E: installSnapshot failed
   java.nio.file.FileAlreadyExistsException: /data/liuzhen_test/master_1216_d426f7a/data/datanode/data/snapshot/.tmp.group-00010000001E/snapshot-70bc5e65-4052-44c5-8aa8-f9104bbd5f7d/66_158230/sequence/root.test.g_3/30/2538/1671440136737-9-0-0.tsfile
           at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
           at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
           at java.nio.file.Files.createDirectory(Files.java:674)
           at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
           at java.nio.file.Files.createDirectories(Files.java:727)
           at org.apache.ratis.util.FileUtils.lambda$createDirectories$4(FileUtils.java:70)
           at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
           at org.apache.ratis.util.FileUtils.createDirectories(FileUtils.java:69)
           at org.apache.ratis.util.FileUtils.createDirectories(FileUtils.java:65)
           at org.apache.ratis.server.storage.SnapshotManager.installSnapshot(SnapshotManager.java:106)
           at org.apache.ratis.server.impl.ServerState.installSnapshot(ServerState.java:480)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.checkAndInstallSnapshot(SnapshotInstallationHandler.java:181)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:120)
           at org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:94)
           at org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1517)
           at org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:640)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:242)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:239)
           at org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:124)
           at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
           at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:332)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:315)
           at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
           at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
           at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo commented on pull request #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #802:
URL: https://github.com/apache/ratis/pull/802#issuecomment-1370567614

   @SzyWilliam , the javadoc https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#createDirectories-java.nio.file.Path-java.nio.file.attribute.FileAttribute...- said that it won't throw an exception if the path already exists as a directory.  It will throw FileAlreadyExistsException when the path exists but is not a directory.  So, I think we need to delete it in this case.  Let's add `createDirectoriesDeleteExistingNonDirectory` to `FileUtils`.
   
   BTW, the `dir` parameter becomes unused after RATIS-1734.  Could you also remove it?
   
   ```java
   diff --git a/ratis-common/src/main/java/org/apache/ratis/util/FileUtils.java b/ratis-common/src/main/java/org/apache/ratis/util/FileUtils.java
   index 15ec6ea9..be736b09 100644
   --- a/ratis-common/src/main/java/org/apache/ratis/util/FileUtils.java
   +++ b/ratis-common/src/main/java/org/apache/ratis/util/FileUtils.java
   @@ -71,6 +71,21 @@ public interface FileUtils {
            () -> "Files.createDirectories " + dir);
      }
    
   +  static void createDirectoriesDeleteExistingNonDirectory(File dir) throws IOException {
   +    createDirectoriesDeleteExistingNonDirectory(dir.toPath());
   +  }
   +
   +  static void createDirectoriesDeleteExistingNonDirectory(Path dir) throws IOException {
   +    try {
   +      createDirectories(dir);
   +    } catch (FileAlreadyExistsException e) {
   +      LOG.warn("Failed to create directory " + dir
   +          + " since it already exists as a non-directory.  Trying to delete it ...", e);
   +      delete(dir);
   +      createDirectories(dir);
   +    }
   +  }
   +
      static void move(File src, File dst) throws IOException {
        move(src.toPath(), dst.toPath());
      }
   diff --git a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
   index 826c4cff..e0832c9b 100644
   --- a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
   +++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java
   @@ -441,7 +441,7 @@ class ServerState {
        // TODO: verify that we need to install the snapshot
        StateMachine sm = server.getStateMachine();
        sm.pause(); // pause the SM to prepare for install snapshot
   -    snapshotManager.installSnapshot(request, sm, getStorage().getStorageDir());
   +    snapshotManager.installSnapshot(request, sm);
        updateInstalledSnapshotIndex(TermIndex.valueOf(request.getSnapshotChunk().getTermIndex()));
      }
    
   diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
   index dbde8694..4ee94bf6 100644
   --- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
   +++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
   @@ -77,8 +77,7 @@ public class SnapshotManager {
            new File(dir.get().getRoot(), c.getFilename()).toPath()).toString();
      }
    
   -  public void installSnapshot(InstallSnapshotRequestProto request, StateMachine stateMachine, RaftStorageDirectory dir)
   -      throws IOException {
   +  public void installSnapshot(InstallSnapshotRequestProto request, StateMachine stateMachine) throws IOException {
        final InstallSnapshotRequestProto.SnapshotChunkProto snapshotChunkRequest = request.getSnapshotChunk();
        final long lastIncludedIndex = snapshotChunkRequest.getTermIndex().getIndex();
    
   @@ -102,7 +101,7 @@ public class SnapshotManager {
          }
    
          final File tmpSnapshotFile = new File(tmpDir, getRelativePath.apply(chunk));
   -      FileUtils.createDirectories(tmpSnapshotFile);
   +      FileUtils.createDirectoriesDeleteExistingNonDirectory(tmpSnapshotFile);
    
          FileOutputStream out = null;
          try {
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] SzyWilliam commented on pull request #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

Posted by GitBox <gi...@apache.org>.
SzyWilliam commented on PR #802:
URL: https://github.com/apache/ratis/pull/802#issuecomment-1370540424

   @szetszwo Thanks for reviewing this! 
   
   After digging deep into the error today, I found the problem is not caused by what I stated previous as 
   
   > The retry attempt will fail because the createDirectories [here](https://github.com/apache/ratis/blob/1b3c61d5dbcdb2b1ac5fd49a624a6cb88ab04744/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java#L105) will throw a file already exist exception. This PR add a if-clause to create directory only when the directory does not exists.
   
   The actual cause is, one large snapshot file may require **multiple installSnapshot RPCs** to transfer. The first RPC will create `tmpSnapshotFile`, and the subsequent RPCs will append chunk data to this existed `tmpSnapshotFile`.  These subsequent RPCs will fail when calling `createDirectories` as the `tmpSnapshotFile` already exists.
   
   Therefore, I think we shall call `createDirectories` once when `tmpSnapshotFile` does not exist. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] SzyWilliam commented on pull request #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

Posted by GitBox <gi...@apache.org>.
SzyWilliam commented on PR #802:
URL: https://github.com/apache/ratis/pull/802#issuecomment-1370750681

   There is a streaming-md5 bug related to multi-chunk scene. I'll bring another PR to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] SzyWilliam merged pull request #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

Posted by GitBox <gi...@apache.org>.
SzyWilliam merged PR #802:
URL: https://github.com/apache/ratis/pull/802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] SzyWilliam commented on pull request #802: RATIS-1764. [GrpcLogAppender] Automatic recover from install snapshot failure

Posted by GitBox <gi...@apache.org>.
SzyWilliam commented on PR #802:
URL: https://github.com/apache/ratis/pull/802#issuecomment-1370748187

   @szetszwo Thanks for supplying the patch. I'll remove the unused dir param. 
   I've read the javadoc and found that `createDirectories` is misused here. The original purpose was to create all non-existing **parent** directories to `tmpSnapshotFile`. However, calling `createDirectories(tmpSnapshot)` will make `tmpSnapshotFile` itself a directory instead of creating parental directories. This mistake somehow passed unit tests and works well, probably due to logic here https://github.com/apache/ratis/blob/1b3c61d5dbcdb2b1ac5fd49a624a6cb88ab04744/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java#L112-L114
   
   Therefore, I add `tmpSnapshotFile.getParentFile()` in 
   ```java
   FileUtils.createDirectoriesDeleteExistingNonDirectory(tmpSnapshotFile.getParentFile());
   ```
   
   PTAL, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org