You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/08/16 14:16:24 UTC

[GitHub] [ozone] bshashikant opened a new pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

bshashikant opened a new pull request #2538:
URL: https://github.com/apache/ozone/pull/2538


   ## What changes were proposed in this pull request?
   The patch ensures readStateMachine data gets executed on the same thread on which the writeChunk for the same chunk gets executed if it's in progress. Write chunks are protected with range locks.
   
   The patch is still a work in progress.
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-5619
   ## How was this patch tested?
   Enabled few tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903594803


   > 
   > 
   > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   
   @bshashikant ,  for ContainerStateMachine#read function,I'm not clear about one thing,when  stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-904271996


   > > > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   > > 
   > > 
   > > @bshashikant , for ContainerStateMachine#read function,I'm not clear about one thing,when stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk.
   > > Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till the write chunk is applied which means, if a redaStateMachine comes in between till write chunk is not finished, it should be in the cache. But the cache has the entry added only for the leader. Now assume a case like this: 3 servers s1, s2 and s3. S1 being leader, s2 is fast follower and s3 being the slow follower. S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the log entries, (yet to complete the write to disk part ) being a follower doesn't update the data cache. Now S2 will be sending append Entries to S3,. S3 might ask for entries from s2 for which the write to disk is not completed neither these entries are in cache, as this node was follower when it received those entries.
   > > BTW, if read chunk gets executed on a thread which is different than the thread which has done write chunk, it doesn't get the same view of the underlying file as the file handle is not closed on write. So, even if Read chunk gets executed post the write chunk completes, it may not necessarily see the same data unless its the same thread. That's why its important to remove the concurrency across multiple chunk writes/read for the same block.
   > 
   > I didn't aware that stateMachineDataCache is only for leader. Now I got the point. I'm wondering if we can have such an cache for follower too. It seems the data is still in memory before the write finish. The cache just holds a reference to this data. What do you think?
   > 
   > Besides, several changes in your previous commit are also good improvement to current code, such as the way of chunkExecutors allocation, and put raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   
   Well, from my point of view, followers maintaining cache as well to avoid few read calls in advent of failure case may not be the right thing to do. Performance improvement efforts can be done with focus on streaming rather. 
   
   > Besides, several changes in your previous commit are also good improvement to current code, such as the way of chunkExecutors allocation, and put raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   
   The idea is to go back and study the perf characteristics with this change and also about file channel concurrency model as well. All other changes can be done as an afterthought of that. For now, i would like to stick to solve the data integrity issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r692827259



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/ChunkUtils.java
##########
@@ -183,17 +180,39 @@ private static long writeDataToChannel(FileChannel channel, ChunkBuffer data,
   public static void readData(File file, ByteBuffer[] buffers,
       long offset, long len, HddsVolume volume)
       throws StorageContainerException {
+    readData(file, null, buffers, offset, len, volume);
+  }
+
+  public static void readData(File file, FileChannel channel,
+      ByteBuffer[] buffers, long offset, long len, HddsVolume volume)
+      throws StorageContainerException {
 
     final Path path = file.toPath();
     final long startTime = Time.monotonicNow();
     final long bytesRead;
+    final long endOffsetToRead = offset + len;
 
     try {
+      if (file.length() > 0) {
+        Preconditions.checkArgument(file.length() >= endOffsetToRead,
+            "file length should atleast match the last offset to read");
+      }
       bytesRead = processFileExclusively(path, () -> {
-        try (FileChannel channel = open(path, READ_OPTIONS, NO_ATTRIBUTES);
-             FileLock ignored = channel.lock(offset, len, true)) {
+        // if the file handle is already cached, just use that for reads.
+        if (channel != null) {
+          try (FileLock lock = channel.lock(offset, len, true)) {
+            channel.position(offset);
+            return channel.position(offset).read(buffers);

Review comment:
       channel.position is called twice. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r690075053



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/ChunkUtils.java
##########
@@ -183,17 +194,45 @@ private static long writeDataToChannel(FileChannel channel, ChunkBuffer data,
   public static void readData(File file, ByteBuffer[] buffers,
       long offset, long len, HddsVolume volume)
       throws StorageContainerException {
+    readData(file, null, buffers, offset, len, volume);
+  }
+
+  public static void readData(File file, FileChannel channel,
+      ByteBuffer[] buffers, long offset, long len, HddsVolume volume)
+      throws StorageContainerException {
 
     final Path path = file.toPath();
     final long startTime = Time.monotonicNow();
     final long bytesRead;
+    final long endOffsetToRead = offset + len;
 
     try {
+      if (file.length() > 0) {
+        Preconditions.checkArgument(file.length() >= endOffsetToRead,
+            "file length should atleast match the last offset to read");
+      }
       bytesRead = processFileExclusively(path, () -> {
-        try (FileChannel channel = open(path, READ_OPTIONS, NO_ATTRIBUTES);
-             FileLock ignored = channel.lock(offset, len, true)) {
+        // if the file handle is already cached, just use that for reads.
+        if (channel != null) {
+          FileLock lock = null;
+          try {
+            lock = channel.lock(offset, len, true);

Review comment:
       Similar to above, we may use try-with-resource here.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/ChunkUtils.java
##########
@@ -163,14 +161,27 @@ private static long writeDataToFile(File file, ChunkBuffer data,
 
   private static long writeDataToChannel(FileChannel channel, ChunkBuffer data,
       long offset) {
+    FileLock lock = null;
     try {
+      lock = channel.lock(offset, data.remaining(), false);

Review comment:
       We may use try-with-resource as below.  Then, we can remove the releaseLock(..) method.
   ```
     private static long writeDataToChannel(FileChannel channel, ChunkBuffer data,
         long offset) {
       try(FileLock lock = channel.lock(offset, data.remaining(), false)) {
         channel.position(offset);
         return data.writeTo(channel);
       } catch (IOException e) {
         throw new UncheckedIOException(e);
       }
     }
   ```
   

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -648,25 +651,38 @@ private ByteString readStateMachineData(
       // readStateMachineData should only be called for "write" to Ratis.
       Preconditions.checkArgument(!HddsUtils.isReadOnly(requestProto));
       if (requestProto.getCmdType() == Type.WriteChunk) {
-        final CompletableFuture<ByteString> future = new CompletableFuture<>();
-        ByteString data = stateMachineDataCache.get(entry.getIndex());
+        final CompletableFuture<ByteString> readStateMachineFuture =
+            new CompletableFuture<>();
+        long index = entry.getIndex();
+        ByteString data = stateMachineDataCache.get(index);
         if (data != null) {
-          future.complete(data);
-          return future;
+          readStateMachineFuture.complete(data);
+          return readStateMachineFuture;
         }
 
-        CompletableFuture.supplyAsync(() -> {
-          try {
-            future.complete(
-                readStateMachineData(requestProto, entry.getTerm(),
-                    entry.getIndex()));
-          } catch (IOException e) {
-            metrics.incNumReadStateMachineFails();
-            future.completeExceptionally(e);
-          }
-          return future;
-        }, getChunkExecutor(requestProto.getWriteChunk()));
-        return future;
+        // Queue the readStateMachine data readStateMachineFuture behind the
+        // write chunk for the same log index if it's under progress.
+        // Otherwise, issue the read
+        // chunk call over any chunk executor thread. Idea here is to ensure,
+        // read chunk always starts once the write chunk for the same log index
+        // finishes.
+        CompletableFuture<Message> response =
+            writeChunkFutureMap.computeIfPresent(index, (key, val) -> {

Review comment:
       computeIfPresent(..) will put a read future to the writeChunkFutureMap, which should only contain write futures.  We may change the code as below
   ```
         if (requestProto.getCmdType() == Type.WriteChunk) {
           final long index = entry.getIndex();
           final ByteString data = stateMachineDataCache.get(index);
           if (data != null) {
             return CompletableFuture.completedFuture(data);
           }
   
           final CompletableFuture<Message> writeFuture
               = Optional.ofNullable(writeChunkFutureMap.get(index))
               .orElse(CompletableFuture.completedFuture(null));
   
           // Queue the readStateMachine data readStateMachineFuture behind the
           // write chunk for the same log index if it's under progress.
           // Otherwise, issue the read
           // chunk call over any chunk executor thread. Idea here is to ensure,
           // read chunk always starts once the write chunk for the same log index
           // finishes.
           return writeFuture.thenCompose(
               w -> getReadStateMachineFuture(requestProto, entry));
         } else {
   ```
   

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -678,6 +694,19 @@ private ByteString readStateMachineData(
     }
   }
 
+  private void getReadStateMachineFuture(
+      CompletableFuture<ByteString> readStateMachineFuture,
+      ContainerCommandRequestProto requestProto, LogEntryProto entry) {
+    try {
+      readStateMachineFuture.complete(
+          readStateMachineData(requestProto, entry.getTerm(),
+              entry.getIndex()));
+    } catch (IOException e) {
+      metrics.incNumReadStateMachineFails();
+      readStateMachineFuture.completeExceptionally(e);
+    }
+  }

Review comment:
       We should make this async as below:
   ```
     private CompletableFuture<ByteString> getReadStateMachineFuture(
         ContainerCommandRequestProto request, LogEntryProto entry) {
       return CompletableFuture.supplyAsync(() -> {
         try {
           return readStateMachineData(request, entry.getTerm(),
               entry.getIndex());
         } catch (IOException e) {
           metrics.incNumReadStateMachineFails();
           throw new CompletionException(
               "Failed to request=" + request + ", entry=" + entry, e);
         }
       }, chunkExecutor);
     }
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903742887


   > 
   > 
   > > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   > 
   > @bshashikant , for ContainerStateMachine#read function,I'm not clear about one thing,when stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk.
   > 
   > Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till the write chunk is applied which means, if a redaStateMachine comes in between till write chunk is not finished, it should be in the cache. But the cache has the entry added only for the leader. Now assume a case like this: 3 servers s1, s2 and s3. S1 being leader, s2 is fast follower and s3 being the slow follower. S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the log entries, (yet to complete the write to disk part ) being a follower doesn't update the data cache. Now S2 will be sending append Entries to S3,. S3 might ask for entries from s2 for which the write to disk is not completed neither these entries are in cache, as this node was follower when it received those entries.
   > 
   > BTW, if read chunk gets executed on a thread which is different than the thread which has done write chunk, it doesn't get the same view of the underlying file as the file handle is not closed on write. So, even if Read chunk gets executed post the write chunk completes, it may not necessarily see the same data unless its the same thread. That's why its important to remove the concurrency across multiple chunk writes/read for the same block.
   
   I didn't aware that stateMachineDataCache is only for leader.  Now I got the point.  I'm wondering if we can have such an cache for follower too.  It seems the data is still in memory before the write finish.  The cache just holds a reference to this data.  What do you think? 
   
   Besides, several changes in your previous commit are also good improvement to current code,  such as the way of chunkExecutors allocation, and put raftFuture instead of writeChunkFuture in the writeChunkFutureMap. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] mukul1987 commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
mukul1987 commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-904545986


   Thanks for the reviews everyone. Merging this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-904274915


   > 
   > 
   > > > > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   > > > 
   > > > 
   > > > @bshashikant , for ContainerStateMachine#read function,I'm not clear about one thing,when stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk.
   > > > Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till the write chunk is applied which means, if a redaStateMachine comes in between till write chunk is not finished, it should be in the cache. But the cache has the entry added only for the leader. Now assume a case like this: 3 servers s1, s2 and s3. S1 being leader, s2 is fast follower and s3 being the slow follower. S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the log entries, (yet to complete the write to disk part ) being a follower doesn't update the data cache. Now S2 will be sending append Entries to S3,. S3 might ask for entries from s2 for which the write to disk is not completed neither these entries are in cache, as this node was follower when it received those entries.
   > > > BTW, if read chunk gets executed on a thread which is different than the thread which has done write chunk, it doesn't get the same view of the underlying file as the file handle is not closed on write. So, even if Read chunk gets executed post the write chunk completes, it may not necessarily see the same data unless its the same thread. That's why its important to remove the concurrency across multiple chunk writes/read for the same block.
   > > 
   > > 
   > > I didn't aware that stateMachineDataCache is only for leader. Now I got the point. I'm wondering if we can have such an cache for follower too. It seems the data is still in memory before the write finish. The cache just holds a reference to this data. What do you think?
   > > Besides, several changes in your previous commit are also good improvement to current code, such as the way of chunkExecutors allocation, and put raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   > 
   > Well, from my point of view, followers maintaining cache as well to avoid few read calls in advent of failure case may not be the right thing to do. Performance improvement efforts can be done with focus on streaming rather.
   > 
   > > Besides, several changes in your previous commit are also good improvement to current code, such as the way of chunkExecutors allocation, and put raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   > 
   > The idea is to go back and study the perf characteristics with this change and also about file channel concurrency model as well. All other changes can be done as an afterthought of that. For now, i would like to stick to solve the data integrity issue.
   
   Agree, the improvement is good to have but not urgent, we can get them back later. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r693706461



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerBlockStrategy.java
##########
@@ -95,15 +94,15 @@ public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
       throws StorageContainerException {
 
     checkLayoutVersion(container);
-
     Preconditions.checkNotNull(dispatcherContext);
     DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
 
     if (info.getLen() <= 0) {
       if (LOG.isDebugEnabled()) {
         LOG.debug("Skip writing empty chunk {} in stage {}", info, stage);
       }
-      return;
+      ExitUtils.terminate(1, "Empty Write chunk received " + info,

Review comment:
       why we terminate here? is there something wrong if we just return?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant edited a comment on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant edited a comment on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-900011687


   As per @szetszwo suggestion, divided the patch into two separate parts, one for the actual bug fix here and the other will be for data validation on server side .
   
   [https://github.com/apache/ozone/pull/2542] is the data validation PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r693689237



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -25,14 +25,8 @@
 import java.io.OutputStream;
 import java.util.List;
 import java.util.Map;
-import java.util.Objects;
-import java.util.concurrent.CompletableFuture;
-import java.util.concurrent.ConcurrentMap;
-import java.util.concurrent.ConcurrentHashMap;
-import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Executors;
-import java.util.concurrent.Semaphore;
-import java.util.concurrent.ThreadPoolExecutor;
+import java.util.Optional;
+import java.util.concurrent.*;

Review comment:
       If I remember it correnlty,  "import  *"  is not recommended in hadoop/ozone checkstyle. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] mukul1987 merged pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
mukul1987 merged pull request #2538:
URL: https://github.com/apache/ozone/pull/2538


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903509096


   The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r689883026



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
##########
@@ -595,6 +603,12 @@ ContainerCommandResponseProto handleReadChunk(
 
       data = chunkManager.readChunk(kvContainer, blockID, chunkInfo,
           dispatcherContext);
+      // Validate data only if the read chunk issued as a result of
+      // ratis call internally. For client reads, validation happens
+      // on the client
+      if (dispatcherContext.isReadFromTmpFile()) {
+        validateChunkData(data, chunkInfo);

Review comment:
       Why do we have checksum verification in chunkManager.readChunk as well as here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-900011687


   As per @szetszwo suggestion, divided the patch into two separate parts, one for the actual bug fix here and the other will be for data validation on server side .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r693719496



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerBlockStrategy.java
##########
@@ -95,15 +94,15 @@ public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
       throws StorageContainerException {
 
     checkLayoutVersion(container);
-
     Preconditions.checkNotNull(dispatcherContext);
     DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
 
     if (info.getLen() <= 0) {
       if (LOG.isDebugEnabled()) {
         LOG.debug("Skip writing empty chunk {} in stage {}", info, stage);
       }
-      return;
+      ExitUtils.terminate(1, "Empty Write chunk received " + info,

Review comment:
       Removed the code..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r689607616



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -678,6 +694,19 @@ private ByteString readStateMachineData(
     }
   }
 
+  private void getReadStateMachineFuture(
+      CompletableFuture<ByteString> readStateMachineFuture,
+      ContainerCommandRequestProto requestProto, LogEntryProto entry) {
+    try {
+      readStateMachineFuture.complete(
+          readStateMachineData(requestProto, entry.getTerm(),
+              entry.getIndex()));
+    } catch (IOException e) {
+      metrics.incNumReadStateMachineFails();
+      readStateMachineFuture.completeExceptionally(e);

Review comment:
       stateMachine will be marked unhealthy right at readStateMachineData() call. Similarly, Container will be marked unhealthy in HddsDispatcher if readChunk fails.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi edited a comment on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi edited a comment on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903594803


   > 
   > 
   > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   
   @bshashikant ,  for ContainerStateMachine#read function,I'm not clear about one thing,when  stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk. 
   
   Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till the write chunk is applied which means, if a redaStateMachine comes in between till write chunk is not finished, it should be in the cache. But the cache has the entry added only for the leader. Now assume a case like this: 3 servers s1, s2 and s3.  S1 being leader, s2 is fast follower and s3 being the slow follower. S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the log entries, (yet to complete the write to disk part ) being a follower doesn't update the data cache. Now S2 will be sending append Entries to S3,. S3 might ask for entries from s2 for which the write to disk is not completed neither these entries are in cache, as this node was follower when it received those entries. 
   
   BTW, if read chunk gets executed on a thread which is different than the thread which has done write chunk, it doesn't get the same view of the underlying file as the file handle is not closed on write. So, even if Read chunk gets executed post the write chunk completes, it may not necessarily see the same data unless its the same thread. That's why its important to remove the concurrency across multiple chunk writes/read for the same block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r690042092



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
##########
@@ -595,6 +603,12 @@ ContainerCommandResponseProto handleReadChunk(
 
       data = chunkManager.readChunk(kvContainer, blockID, chunkInfo,
           dispatcherContext);
+      // Validate data only if the read chunk issued as a result of
+      // ratis call internally. For client reads, validation happens
+      // on the client
+      if (dispatcherContext.isReadFromTmpFile()) {
+        validateChunkData(data, chunkInfo);

Review comment:
       This verification is only while reading chunk bcoz of ratis readStateMachine Data call.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] bshashikant edited a comment on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
bshashikant edited a comment on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903509096


   The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r693689237



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -25,14 +25,8 @@
 import java.io.OutputStream;
 import java.util.List;
 import java.util.Map;
-import java.util.Objects;
-import java.util.concurrent.CompletableFuture;
-import java.util.concurrent.ConcurrentMap;
-import java.util.concurrent.ConcurrentHashMap;
-import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Executors;
-import java.util.concurrent.Semaphore;
-import java.util.concurrent.ThreadPoolExecutor;
+import java.util.Optional;
+import java.util.concurrent.*;

Review comment:
       I remember Import  "*"  is not recommended in hadoop/ozone checkstyle. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] mukul1987 commented on a change in pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
mukul1987 commented on a change in pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#discussion_r689589409



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
##########
@@ -678,6 +694,19 @@ private ByteString readStateMachineData(
     }
   }
 
+  private void getReadStateMachineFuture(
+      CompletableFuture<ByteString> readStateMachineFuture,
+      ContainerCommandRequestProto requestProto, LogEntryProto entry) {
+    try {
+      readStateMachineFuture.complete(
+          readStateMachineData(requestProto, entry.getTerm(),
+              entry.getIndex()));
+    } catch (IOException e) {
+      metrics.incNumReadStateMachineFails();
+      readStateMachineFuture.completeExceptionally(e);

Review comment:
       we should mark the statemachine unhealthy and container unhealthy if we fail here ?

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/freon/TestRandomKeyGenerator.java
##########
@@ -36,7 +36,7 @@
 /**
  * Tests Freon, with MiniOzoneCluster.
  */
-@Ignore("HDDS-3290")
+//@Ignore("HDDS-3290")

Review comment:
       Lets remove the @Ignore here 

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
##########
@@ -52,6 +52,8 @@
   public static final String DISK_CHECK_TIMEOUT_KEY =
       "hdds.datanode.disk.check.timeout";
 
+  static final boolean CHUNK_DATA_VALIDATION_CHECK_DEFAULT = true;

Review comment:
       The value is set to true by default, lets have this as false by default but lets enable this in MiniOzoneCluster

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/freon/TestRandomKeyGenerator.java
##########
@@ -62,6 +62,7 @@ public static void init() throws Exception {
     raftClientConfig.setRpcRequestTimeout(Duration.ofSeconds(3));
     raftClientConfig.setRpcWatchRequestTimeout(Duration.ofSeconds(3));
     conf.setFromObject(raftClientConfig);
+    conf.setBoolean("hdds.datanode.chunk.data.validation.check", true);

Review comment:
       I will suggest that lets enable this flag by default on all the unit test via MiniOzoneCluster and also docker based tests.

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/freon/TestRandomKeyGenerator.java
##########
@@ -127,7 +128,7 @@ public void ratisTest3() throws Exception {
   }
 
   @Test
-  @Ignore("HDDS-2011")
+  //@Ignore("HDDS-2011")

Review comment:
       Same here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ChenSammi edited a comment on pull request #2538: HDDS-5619. Ozone data corruption issue on follower node.

Posted by GitBox <gi...@apache.org>.
ChenSammi edited a comment on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-903594803


   > 
   > 
   > The issue eventually turned out to be race among write chunk/readChunk threads all using the same file channel. With concurrent writers and readers on the same file, the behaviour seems unpredictable and leads to block file sparseness on disk.
   
   @bshashikant ,  for ContainerStateMachine#read function,I'm not clear about one thing,when  stateMachineDataCache.get(entry.getIndex()) return null, why we cannot think that the data is already written to the disk. 
   
   Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till the write chunk is applied which means, if a redaStateMachine comes in between till write chunk is not finished, it should be in the cache. But the cache has the entry added only for the leader. Now assume a case like this: 3 servers s1, s2 and s3.  S1 being leader, s2 is fast follower and s3 being the slow follower. S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the log entries, (yet to complete the write to disk part ) being a follower doesn't update the data cache. Now S2 will be sending append Entries to S3,. S3 might ask for entries from s2 for which the write to disk is not completed neither these entries are in cache, as this node was follower when it received those entries. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org