You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "ibrusentsev (via GitHub)" <gi...@apache.org> on 2023/12/04 16:49:59 UTC

[PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

ibrusentsev opened a new pull request, #5727:
URL: https://github.com/apache/ozone/pull/5727

   ### Motivation
   When a key is uploading by XcieverClientRatis, and some datanode becomes unavailable, it is expected that client should request new pipeline to retry upload.
   In fact, before that client tries to repeat commit check with MAJORITY_COMMITTED replication level, which cannot be successful as at that moment pipeline is already closed.
   XceiverClientRatis has method watchForCommit(long index), which contains exception check
   
   ```
   if (t instanceof GroupMismatchException) {
     throw e;
   }
   ```
   GroupMismatchException throws by Ratis client exactly when some datanode is not available and further key upload is not available for current pipeline.
   
   But this check does not work as `Throwable t = HddsClientUtils.checkForException(e);` does not unwrap exception completely. The idea is fix lookup of nested exceptions to find proper one.
   
   ### Changes
   Replace `checkForException` method with new `containsException` method, which checks all nested exceptions if one of them matched with expected exception class.
   
   ### Result
   This improves failover latency by 15 seconds approximately.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-9826
   
   ## How was this patch tested?
   
   Extended integration test testWatchForCommitForRetryfailure()
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "ibrusentsev (via GitHub)" <gi...@apache.org>.
ibrusentsev commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1417035614


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestWatchForCommit.java:
##########
@@ -369,6 +369,9 @@ public void testWatchForCommitForGroupMismatchException() throws Exception {
       } catch (Exception e) {
         Assert.assertTrue(HddsClientUtils
             .checkForException(e) instanceof GroupMismatchException);
+        // client should not attempt to watch with
+        // MAJORITY_COMMITTED replication level
+        Assert.assertFalse(e.getMessage().contains("Watch-MAJORITY_COMMITTED"));

Review Comment:
   Added some workaround to make the test stable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "ibrusentsev (via GitHub)" <gi...@apache.org>.
ibrusentsev commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1415712607


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestWatchForCommit.java:
##########
@@ -282,6 +288,9 @@ public void testWatchForCommitForRetryfailure() throws Exception {
         // RuntimeException
         Assert.assertFalse(HddsClientUtils
             .checkForException(e) instanceof TimeoutException);
+        // client should not attempt to watch with
+        // MAJORITY_COMMITTED replication level
+        Assert.assertFalse(e.getMessage().contains("Watch-MAJORITY_COMMITTED"));

Review Comment:
   Unfortunately, no. Exceptions are equal, the difference is only in message. But I moved this check to another test testWatchForCommitForGroupMismatchException() to avoid flaky behaviour - when shutdown DN, the "java.net.ConnectException: Connection refused" is thrown sometimes, instead of GroupMismatchedException.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1416396623


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestWatchForCommit.java:
##########
@@ -369,6 +369,9 @@ public void testWatchForCommitForGroupMismatchException() throws Exception {
       } catch (Exception e) {
         Assert.assertTrue(HddsClientUtils
             .checkForException(e) instanceof GroupMismatchException);
+        // client should not attempt to watch with
+        // MAJORITY_COMMITTED replication level
+        Assert.assertFalse(e.getMessage().contains("Watch-MAJORITY_COMMITTED"));

Review Comment:
   Can you please explain what this is testing?  This assertion passes without any other code change on `master`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5727:
URL: https://github.com/apache/ozone/pull/5727#issuecomment-1842854148

   Thanks @ibrusentsev for the patch, @jojochuang for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "ibrusentsev (via GitHub)" <gi...@apache.org>.
ibrusentsev commented on PR #5727:
URL: https://github.com/apache/ozone/pull/5727#issuecomment-1842255866

   @adoroszlai thank you for the review.
   > I think the real problem is that `checkForException` requires `t` to have a cause. This requirement was added in [29cd37d](https://github.com/apache/ozone/commit/29cd37db9f712844403f1d0bb5af94e2d72712a5), a few months after handling of `GroupMismatchException` in `XceiverClientRatis` in [d6ccaf7](https://github.com/apache/ozone/commit/d6ccaf7ecf7e26bb302075463f972b34e3f87219).
   
   I'm not sure about that. The problem is that checkForException() accepts exception with following cause structure: java.util.concurrent.ExecutionException -> org.apache.ratis.protocol.exceptions.RaftRetryFailureException -> org.apache.ratis.protocol.exceptions.GroupMismatchException. We need to extract GroupMismatchException, but getting RaftRetryFailureException, as both of these exceptions are in the getExceptionList().


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "ibrusentsev (via GitHub)" <gi...@apache.org>.
ibrusentsev commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1416858311


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestWatchForCommit.java:
##########
@@ -369,6 +369,9 @@ public void testWatchForCommitForGroupMismatchException() throws Exception {
       } catch (Exception e) {
         Assert.assertTrue(HddsClientUtils
             .checkForException(e) instanceof GroupMismatchException);
+        // client should not attempt to watch with
+        // MAJORITY_COMMITTED replication level
+        Assert.assertFalse(e.getMessage().contains("Watch-MAJORITY_COMMITTED"));

Review Comment:
   Indeed, it actually tests nothing now. I moved it from testWatchForCommitForRetryfailure(), because I was flaky - sometimes instead expected exception` (java.util.concurrent.ExecutionException -> org.apache.ratis.protocol.exceptions.RaftRetryFailureException -> org.apache.ratis.protocol.exceptions.GroupMismatchException)`, got a `(java.util.concurrent.ExecutionException -> org.apache.ratis.protocol.exceptions.RaftRetryFailureException -> java.io.IOException -> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException -> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused)`, when RaftServer (leader DN) tries to reach node that was shutdown. Can't get how to make it stable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "ibrusentsev (via GitHub)" <gi...@apache.org>.
ibrusentsev commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1417035000


##########
hadoop-hdds/client/src/test/java/org/apache/hadoop/hdds/scm/client/HddsClientUtilsTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ */
+package org.apache.hadoop.hdds.scm.client;
+
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.net.ConnectException;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertSame;
+
+class HddsClientUtilsTest {

Review Comment:
   Thanks, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "jojochuang (via GitHub)" <gi...@apache.org>.
jojochuang commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1414272724


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestWatchForCommit.java:
##########
@@ -282,6 +288,9 @@ public void testWatchForCommitForRetryfailure() throws Exception {
         // RuntimeException
         Assert.assertFalse(HddsClientUtils
             .checkForException(e) instanceof TimeoutException);
+        // client should not attempt to watch with
+        // MAJORITY_COMMITTED replication level
+        Assert.assertFalse(e.getMessage().contains("Watch-MAJORITY_COMMITTED"));

Review Comment:
   Nit: Is it possible to match exception class rather than exception message? Makes future maintenance easier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai merged PR #5727:
URL: https://github.com/apache/ozone/pull/5727


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #5727:
URL: https://github.com/apache/ozone/pull/5727#discussion_r1416384212


##########
hadoop-hdds/client/src/test/java/org/apache/hadoop/hdds/scm/client/HddsClientUtilsTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ */
+package org.apache.hadoop.hdds.scm.client;
+
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.net.ConnectException;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertSame;
+
+class HddsClientUtilsTest {

Review Comment:
   Ozone tests should be name `Test*`, not `*Test`.  The latter is not run by Maven:
   
   https://github.com/apache/ozone/blob/3c9df5cf3292b2feb0ffa1aa9c254b1c51c8cb9a/pom.xml#L2022-L2024
   
   Test class for `HddsClientUtils` exists, unfortunately it's in the wrong directory:
   
   `hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestHddsClientUtils.java`
   
   Created HDDS-9845 for relocating this test class to the right place.
   
   For now, please add the new test case to the existing `TestHddsClientUtils`,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9826. Fix exception handling if one Datanode is not available (Ratis) [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5727:
URL: https://github.com/apache/ozone/pull/5727#issuecomment-1842269631

   > > I think the real problem is that `checkForException` requires `t` to have a cause.
   > 
   > I'm not sure about that. The problem is that checkForException() accepts exception with following cause structure: `java.util.concurrent.ExecutionException -> org.apache.ratis.protocol.exceptions.RaftRetryFailureException -> org.apache.ratis.protocol.exceptions.GroupMismatchException`. We need to extract GroupMismatchException, but getting RaftRetryFailureException, as both of these exceptions are in the getExceptionList().
   
   Thanks for the details.  `RaftRetryFailureException` being in the list is definitely another problem here.  We could fix that by not stopping at the first exception found if there is a root cause, but that might introduce problems elsewhere in the code.
   
   So I'm OK with adding the new method to fix this case.
   
   (The test code still needs to be moved, though.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org