You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/10/25 18:06:21 UTC

[GitHub] [ozone] umamaheswararao opened a new pull request #2767: Hdds 5491: EC: Write should handle node failures.

umamaheswararao opened a new pull request #2767:
URL: https://github.com/apache/ozone/pull/2767


   ## What changes were proposed in this pull request?
   
   Handled node failures. When nodes failed in write, we will get new block group and proceed to write there.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5491
   
   ## How was this patch tested?
   
   Added tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741499089



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
##########
@@ -700,6 +714,7 @@ void writeChunkToContainer(ChunkBuffer chunk) throws IOException {
       handleInterruptedException(ex, false);
     }
     containerBlockData.addChunks(chunkInfo);
+    return null;

Review comment:
       If we are changing the signature of this method, shouldn't we also return the future inside the try block for future users of this method whom might expect a non-null value on success?
   
   I noticed that we are changing the method just to store the future object in ECBlockOutputStream, as we override the method there, why not just store the variable in the overridden method into a private field of ECBlockOutputStream?
   
   As an alternative if this seems to be useful to store and return, why not store here in the BlockOutputStream, and provide an accessor method for the future from here?

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       This variable is set from two places, from the write method, and from executePutBlock, executePutBlock happens on flush, and close, which is a possible race.
   With the current synchronous grpc client it is not a problem, but if we switch to really async writes in the grpc client, then this might cause a race, and we might miss to check some of the futures. Is there a guarantee that I miss here?

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -63,7 +69,65 @@ public ECBlockOutputStream(
 
   @Override
   public void write(byte[] b, int off, int len) throws IOException {
-    writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+    this.currentChunkRspFuture =
+        writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+  }
+
+  /**
+   * Writes buffered data as a new chunk to the container and saves chunk
+   * information to be used later in putKey call.
+   *
+   * @throws IOException if there is an I/O error while performing the call
+   * @throws OzoneChecksumException if there is an error while computing
+   * checksum
+   * @return ContainerCommandResponseProto
+   */
+  CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      writeChunkToContainer(ChunkBuffer chunk) throws IOException {

Review comment:
       We are duplicating the bulk of this method from BlockOutputStream, wouldn't it be better to skip adding the accessor methods for private variables of BlockOutputStream, and extract the part that is different to a method that we can override here?
   
   As I see the major change is the point where we are adding the chunkInfo objects to the containerBlockData builder used in putBlock. I am not sure, but I guess we can change the place of addChunkInfo call in the original method, at least I do not see a reason why a failed write should have that call, and with that and with returning the future in the original method we do not need this override, and the getters in the super class at all. What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r742898972



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Currently failure tests we added only from this JIRA. Yes, when failed last stripe will be written to new block group. This was the decision we made in one of the conversations. So, I don;t think we have any tests to modify now, but if when we add more and wanted to verify number of blockgroups with multiple time failures, then yes we should consider updating this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#issuecomment-958668106


   I think I submitted replies via Review button, so they are just duplicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao merged pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao merged pull request #2767:
URL: https://github.com/apache/ozone/pull/2767


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#issuecomment-958668106


   I think I submitted replies via Review button, so they are just duplicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741621874



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -63,7 +69,65 @@ public ECBlockOutputStream(
 
   @Override
   public void write(byte[] b, int off, int len) throws IOException {
-    writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+    this.currentChunkRspFuture =
+        writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+  }
+
+  /**
+   * Writes buffered data as a new chunk to the container and saves chunk
+   * information to be used later in putKey call.
+   *
+   * @throws IOException if there is an I/O error while performing the call
+   * @throws OzoneChecksumException if there is an error while computing
+   * checksum
+   * @return ContainerCommandResponseProto
+   */
+  CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      writeChunkToContainer(ChunkBuffer chunk) throws IOException {

Review comment:
       I agree, this is a good point. I started with copying that code part and proceeded with that. In fact we can remove this code and move addChunkInfo up into try block there in BOS.

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Data supposed to store in single block group. Since we introduced the failures after first stripe, the second stripe data should have been written into new blockgroup. So, we should have 2 block groups. That means two keyLocations.
   
   I added this comment to explain.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       The problem is we cannot allow the writer asynchronously out of one full stripe. Otherwise we have to cache all the stripe data which could be costly.  Let's say we allow multiple stripes asynchronously and later after writing multiple stripes data, first stripe received failure. If we don't cache all of the stripes we have written so far, we can;t copy back that data to new blockgroup. I think separately we can explore to cache elastically some stripes data and allow asynchronously. Currently the async nature applies to stripe level. I checked HDFS and we are inline with that behavior. Right after parity cells writing, we check for failures.

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Currently in write flow and close, we are already making sure to check the failures just before invoking executePutBlock. For flush, I don't think we are officially tested yet. Need tests for it separately and check. However that future is actually used to validate the response for failures. I quite surprise if write replies exception and subsequent executePutblock reply success. When we test flush and handle, we need to makesure we check for failures before writing parities. But it is very hard to support flush and IIRC, we are not supporting flush for EC.
   From HDFS:
     ```
   @Override
     public void hflush() {
       // not supported yet
       LOG.debug("DFSStripedOutputStream does not support hflush. "
           + "Caller should check StreamCapabilities before calling.");
     }
   ```

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
##########
@@ -700,6 +714,7 @@ void writeChunkToContainer(ChunkBuffer chunk) throws IOException {
       handleInterruptedException(ex, false);
     }
     containerBlockData.addChunks(chunkInfo);
+    return null;

Review comment:
       This is a good point. I have added that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741621874



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -63,7 +69,65 @@ public ECBlockOutputStream(
 
   @Override
   public void write(byte[] b, int off, int len) throws IOException {
-    writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+    this.currentChunkRspFuture =
+        writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+  }
+
+  /**
+   * Writes buffered data as a new chunk to the container and saves chunk
+   * information to be used later in putKey call.
+   *
+   * @throws IOException if there is an I/O error while performing the call
+   * @throws OzoneChecksumException if there is an error while computing
+   * checksum
+   * @return ContainerCommandResponseProto
+   */
+  CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      writeChunkToContainer(ChunkBuffer chunk) throws IOException {

Review comment:
       I agree, this is a good point. I started with copying that code part and proceeded with that. In fact we can remove this code and move addChunkInfo up into try block there in BOS.

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Data supposed to store in single block group. Since we introduced the failures after first stripe, the second stripe data should have been written into new blockgroup. So, we should have 2 block groups. That means two keyLocations.
   
   I added this comment to explain.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       The problem is we cannot allow the writer asynchronously out of one full stripe. Otherwise we have to cache all the stripe data which could be costly.  Let's say we allow multiple stripes asynchronously and later after writing multiple stripes data, first stripe received failure. If we don't cache all of the stripes we have written so far, we can;t copy back that data to new blockgroup. I think separately we can explore to cache elastically some stripes data and allow asynchronously. Currently the async nature applies to stripe level. I checked HDFS and we are inline with that behavior. Right after parity cells writing, we check for failures.

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Currently in write flow and close, we are already making sure to check the failures just before invoking executePutBlock. For flush, I don't think we are officially tested yet. Need tests for it separately and check. However that future is actually used to validate the response for failures. I quite surprise if write replies exception and subsequent executePutblock reply success. When we test flush and handle, we need to makesure we check for failures before writing parities. But it is very hard to support flush and IIRC, we are not supporting flush for EC.
   From HDFS:
     ```
   @Override
     public void hflush() {
       // not supported yet
       LOG.debug("DFSStripedOutputStream does not support hflush. "
           + "Caller should check StreamCapabilities before calling.");
     }
   ```

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
##########
@@ -700,6 +714,7 @@ void writeChunkToContainer(ChunkBuffer chunk) throws IOException {
       handleInterruptedException(ex, false);
     }
     containerBlockData.addChunks(chunkInfo);
+    return null;

Review comment:
       This is a good point. I have added that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r742966478



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Sounds good, understood.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741499267



##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       With this call, we are synchronising on parity write, and we loose the async nature of the write on this front as well. We discussed today that the grpc client does not do writes async, if we fix that, then we still loose async writes as parity writes become synced. I believe this we should record as a future improvement here, as handling the rewrite of the stripe in a more async manner would probably require a significant change in how we buffer data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r742898972



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Currently failure tests we added only from this JIRA. Yes, when failed last stripe will be written to new block group. This was the decision we made in one of the conversations. So, I don;t think we have any tests to modify now, but if we add more and wanted to verify number of blockgroups, then yes we should consider this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741499354



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Why we asserting 2 as the number of keyLocations? I am unsure what should be the size of this, but I assume it is a function of the block size the number of datanodes and the number of failures, isn't it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741621874



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -63,7 +69,65 @@ public ECBlockOutputStream(
 
   @Override
   public void write(byte[] b, int off, int len) throws IOException {
-    writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+    this.currentChunkRspFuture =
+        writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+  }
+
+  /**
+   * Writes buffered data as a new chunk to the container and saves chunk
+   * information to be used later in putKey call.
+   *
+   * @throws IOException if there is an I/O error while performing the call
+   * @throws OzoneChecksumException if there is an error while computing
+   * checksum
+   * @return ContainerCommandResponseProto
+   */
+  CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      writeChunkToContainer(ChunkBuffer chunk) throws IOException {

Review comment:
       I agree, this is a good point. I started with copying that code part and proceeded with that. In fact we can remove this code and move addChunkInfo up into try block there in BOS.

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Data supposed to store in single block group. Since we introduced the failures after first stripe, the second stripe data should have been written into new blockgroup. So, we should have 2 block groups. That means two keyLocations.
   
   I added this comment to explain.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       The problem is we cannot allow the writer asynchronously out of one full stripe. Otherwise we have to cache all the stripe data which could be costly.  Let's say we allow multiple stripes asynchronously and later after writing multiple stripes data, first stripe received failure. If we don't cache all of the stripes we have written so far, we can;t copy back that data to new blockgroup. I think separately we can explore to cache elastically some stripes data and allow asynchronously. Currently the async nature applies to stripe level. I checked HDFS and we are inline with that behavior. Right after parity cells writing, we check for failures.

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Currently in write flow and close, we are already making sure to check the failures just before invoking executePutBlock. For flush, I don't think we are officially tested yet. Need tests for it separately and check. However that future is actually used to validate the response for failures. I quite surprise if write replies exception and subsequent executePutblock reply success. When we test flush and handle, we need to makesure we check for failures before writing parities. But it is very hard to support flush and IIRC, we are not supporting flush for EC.
   From HDFS:
     ```
   @Override
     public void hflush() {
       // not supported yet
       LOG.debug("DFSStripedOutputStream does not support hflush. "
           + "Caller should check StreamCapabilities before calling.");
     }
   ```

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
##########
@@ -700,6 +714,7 @@ void writeChunkToContainer(ChunkBuffer chunk) throws IOException {
       handleInterruptedException(ex, false);
     }
     containerBlockData.addChunks(chunkInfo);
+    return null;

Review comment:
       This is a good point. I have added that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741971589



##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       Yes, that's what I meant as well, but I realized I forgot my point here :D I wanted to suggest to note this behaviour in the comment above the if or in checkStreamFailures API doc ;)

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Understood. Won't this be a problem if we write more chunks via this general purpose method, and the amount of data written spans into multiple blockggroups already, and the failures injected are also adding more block groups?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#issuecomment-958668106






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r742898972



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Currently failure tests we added only from this JIRA. Yes, when failed last stripe will be written to new block group. This was the decision we made in one of the conversations. So, I don;t think we have any tests to modify now, but if when we add more and wanted to verify number of blockgroups with multiple time failures, then yes we should consider updating this. I will take multiple failure cases separately and add more tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741971338



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Ah, if flushes won't be supported, then probably this is not a problem, as executePutBlock than will happen after evaluating the future in writes (due to the sync nature of how we test for failures). I can live with this, though I still wondering if it would be safer to use two references one to store futures from wirtes, and one from putBlock, and check for both in the failure check, though I do not have any case in mind atm without flush where this would be a real problem in the current system, so I am ok with leaving this as it is.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       Yes, that's what I meant as well, but I realized I forgot my point here :D I wanted to suggest to note this behaviour in the comment above the if or in checkStreamFailures API doc ;)

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Understood. Won't this be a problem if we write more chunks via this general purpose method, and the amount of data written spans into multiple blockggroups already, and the failures injected are also adding more block groups?

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Ah, if flushes won't be supported, then probably this is not a problem, as executePutBlock than will happen after evaluating the future in writes (due to the sync nature of how we test for failures). I can live with this, though I still wondering if it would be safer to use two references one to store futures from wirtes, and one from putBlock, and check for both in the failure check, though I do not have any case in mind atm without flush where this would be a real problem in the current system, so I am ok with leaving this as it is.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       Yes, that's what I meant as well, but I realized I forgot my point here :D I wanted to suggest to note this behaviour in the comment above the if or in checkStreamFailures API doc ;)

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Understood. Won't this be a problem if we write more chunks via this general purpose method, and the amount of data written spans into multiple blockggroups already, and the failures injected are also adding more block groups?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741971338



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Ah, if flushes won't be supported, then probably this is not a problem, as executePutBlock than will happen after evaluating the future in writes (due to the sync nature of how we test for failures). I can live with this, though I still wondering if it would be safer to use two references one to store futures from wirtes, and one from putBlock, and check for both in the failure check, though I do not have any case in mind atm without flush where this would be a real problem in the current system, so I am ok with leaving this as it is.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       Yes, that's what I meant as well, but I realized I forgot my point here :D I wanted to suggest to note this behaviour in the comment above the if or in checkStreamFailures API doc ;)

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Understood. Won't this be a problem if we write more chunks via this general purpose method, and the amount of data written spans into multiple blockggroups already, and the failures injected are also adding more block groups?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741971338



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Ah, if flushes won't be supported, then probably this is not a problem, as executePutBlock than will happen after evaluating the future in writes (due to the sync nature of how we test for failures). I can live with this, though I still wondering if it would be safer to use two references one to store futures from wirtes, and one from putBlock, and check for both in the failure check, though I do not have any case in mind atm without flush where this would be a real problem in the current system, so I am ok with leaving this as it is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r742898972



##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Currently failure tests we added only from this JIRA. Yes, when failed last stripe will be written to new block group. This was the decision we made in one of the conversations. So, I don;t think we have any tests to modify now, but if we add more and wanted to verify number of blockgroups with multiple time failures, then yes we should consider updating this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #2767: Hdds 5491: EC: Write should handle node failures.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #2767:
URL: https://github.com/apache/ozone/pull/2767#discussion_r741621874



##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -63,7 +69,65 @@ public ECBlockOutputStream(
 
   @Override
   public void write(byte[] b, int off, int len) throws IOException {
-    writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+    this.currentChunkRspFuture =
+        writeChunkToContainer(ChunkBuffer.wrap(ByteBuffer.wrap(b, off, len)));
+  }
+
+  /**
+   * Writes buffered data as a new chunk to the container and saves chunk
+   * information to be used later in putKey call.
+   *
+   * @throws IOException if there is an I/O error while performing the call
+   * @throws OzoneChecksumException if there is an error while computing
+   * checksum
+   * @return ContainerCommandResponseProto
+   */
+  CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      writeChunkToContainer(ChunkBuffer chunk) throws IOException {

Review comment:
       I agree, this is a good point. I started with copying that code part and proceeded with that. In fact we can remove this code and move addChunkInfo up into try block there in BOS.

##########
File path: hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java
##########
@@ -420,6 +420,80 @@ public void testPartialStripeWithPartialLastChunk()
     }
   }
 
+  @Test
+  public void testWriteShouldFailIfMoreThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(3, 3);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfLessThanParityNodesFail()
+      throws IOException {
+    testNodeFailuresWhileWriting(1, 2);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIf4NodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  @Test
+  public void testWriteShouldSuccessIfAllNodesFailed()
+      throws IOException {
+    testNodeFailuresWhileWriting(4, 1);
+  }
+
+  public void testNodeFailuresWhileWriting(int numFailureToInject,
+      int numChunksToWriteAfterFailure) throws IOException {
+    store.createVolume(volumeName);
+    OzoneVolume volume = store.getVolume(volumeName);
+    volume.createBucket(bucketName);
+    OzoneBucket bucket = volume.getBucket(bucketName);
+
+    try (OzoneOutputStream out = bucket.createKey(keyName, 1024 * 3,
+        new ECReplicationConfig(3, 2, ECReplicationConfig.EcCodec.RS,
+            chunkSize), new HashMap<>())) {
+      for (int i = 0; i < dataBlocks; i++) {
+        out.write(inputChunks[i]);
+      }
+
+      List<DatanodeDetails> failedDNs = new ArrayList<>();
+      Map<DatanodeDetails, MockDatanodeStorage> storages =
+          ((MockXceiverClientFactory) factoryStub).getStorages();
+      DatanodeDetails[] dnDetails =
+          storages.keySet().toArray(new DatanodeDetails[storages.size()]);
+      for (int i = 0; i < numFailureToInject; i++) {
+        failedDNs.add(dnDetails[i]);
+      }
+
+      // First let's set storage as bad
+      ((MockXceiverClientFactory) factoryStub).setFailedStorages(failedDNs);
+
+      for (int i = 0; i < numChunksToWriteAfterFailure; i++) {
+        out.write(inputChunks[i]);
+      }
+    }
+    final OzoneKeyDetails key = bucket.getKey(keyName);
+    Assert.assertEquals(2, key.getOzoneKeyLocations().size());

Review comment:
       Data supposed to store in single block group. Since we introduced the failures after first stripe, the second stripe data should have been written into new blockgroup. So, we should have 2 block groups. That means two keyLocations.
   
   I added this comment to explain.

##########
File path: hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/ECKeyOutputStream.java
##########
@@ -239,7 +293,15 @@ private void handleParityWrites(int parityCellSize,
     // TODO: we should alter the put block calls to share CRC to each stream.
     ECBlockOutputStreamEntry streamEntry =
         blockOutputStreamEntryPool.getCurrentStreamEntry();
+    // Since writes are async, let's check the failures once.
+    if(streamEntry.checkStreamFailures()){

Review comment:
       The problem is we cannot allow the writer asynchronously out of one full stripe. Otherwise we have to cache all the stripe data which could be costly.  Let's say we allow multiple stripes asynchronously and later after writing multiple stripes data, first stripe received failure. If we don't cache all of the stripes we have written so far, we can;t copy back that data to new blockgroup. I think separately we can explore to cache elastically some stripes data and allow asynchronously. Currently the async nature applies to stripe level. I checked HDFS and we are inline with that behavior. Right after parity cells writing, we check for failures.

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ECBlockOutputStream.java
##########
@@ -35,12 +38,15 @@
 import java.util.concurrent.ExecutionException;
 
 import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.putBlockAsync;
+import static org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.writeChunkAsync;
 
 /**
  * Handles the chunk EC writes for an EC internal block.
  */
 public class ECBlockOutputStream extends BlockOutputStream{
 
+  private CompletableFuture<ContainerProtos.ContainerCommandResponseProto>
+      currentChunkRspFuture = null;

Review comment:
       Currently in write flow and close, we are already making sure to check the failures just before invoking executePutBlock. For flush, I don't think we are officially tested yet. Need tests for it separately and check. However that future is actually used to validate the response for failures. I quite surprise if write replies exception and subsequent executePutblock reply success. When we test flush and handle, we need to makesure we check for failures before writing parities. But it is very hard to support flush and IIRC, we are not supporting flush for EC.
   From HDFS:
     ```
   @Override
     public void hflush() {
       // not supported yet
       LOG.debug("DFSStripedOutputStream does not support hflush. "
           + "Caller should check StreamCapabilities before calling.");
     }
   ```

##########
File path: hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
##########
@@ -700,6 +714,7 @@ void writeChunkToContainer(ChunkBuffer chunk) throws IOException {
       handleInterruptedException(ex, false);
     }
     containerBlockData.addChunks(chunkInfo);
+    return null;

Review comment:
       This is a good point. I have added that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org