You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "sumitagrawl (via GitHub)" <gi...@apache.org> on 2023/05/25 15:35:53 UTC

[GitHub] [ratis] sumitagrawl opened a new pull request, #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

sumitagrawl opened a new pull request, #883:
URL: https://github.com/apache/ratis/pull/883

   ## What changes were proposed in this pull request?
   
   To avoid aggresive retry when peer is unrechable failure or timeout, error wait strategy introduced
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-1836
   
   ## How was this patch tested?
   
   - Integration test done with ozone environment
   - Unit test (pending)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on PR #883:
URL: https://github.com/apache/ratis/pull/883#issuecomment-1569404661

   > @sumitagrawl , thanks for the update. Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13058648/883_review.patch
   
   @szetszwo Its fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on PR #883:
URL: https://github.com/apache/ratis/pull/883#issuecomment-1563167616

   > RetryPolicies
   
   @adoroszlai 
   - Retry policy is fixed wait retry strategy, ErrWaitStrategy can be embedded as another retry policy
   - ErrWaitStrategy focus over incremental wait with failure count increasing
   - GrpcLogAppender -- this is not a retry operation, but trigger events at regular interval. We can integrate behavior of RetryPolicy wait time (with ErrWaitStrategy), but may need a quite refactoring. If this is required to refactor GrpcAppender, can take as another JIRA for improvement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "szetszwo (via GitHub)" <gi...@apache.org>.
szetszwo commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1210369002


##########
ratis-common/src/main/java/org/apache/ratis/util/ErrorWaitStrategy.java:
##########
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.ratis.util;
+
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicLong;
+import org.apache.ratis.retry.MultipleLinearRandomRetry;
+import org.apache.ratis.retry.RetryPolicy;
+
+/*
+ * Wait strategy for incremental wait based on error count.
+ */
+public class ErrorWaitStrategy implements RetryPolicy.Event {

Review Comment:
   After we use `MultipleLinearRandomRetry`, this class may not be useful.  Let's remove it.



##########
ratis-server-api/src/main/java/org/apache/ratis/server/DivisionProperties.java:
##########
@@ -52,4 +52,7 @@ default int maxRpcTimeoutMs() {
 
   /** @return the rpc slowness timeout. */
   TimeDuration rpcSlownessTimeout();
+
+  /** @return the additional error wait delay value. */
+  String errorWaitDelayValue();

Review Comment:
   Since this is an internal conf (not something we want other application such as Ozone to use it).  Please don't add it  to server-api.



##########
ratis-server-api/src/main/java/org/apache/ratis/server/RaftServerConfigKeys.java:
##########
@@ -743,6 +743,21 @@ static TimeDuration slownessTimeout(RaftProperties properties) {
     static void setSlownessTimeout(RaftProperties properties, TimeDuration expiryTime) {
       setTimeDuration(properties::setTimeDuration, SLOWNESS_TIMEOUT_KEY, expiryTime);
     }
+    
+    String ERROR_WAIT_DELAY_VALUE = PREFIX + ".error.wait.delay";

Review Comment:
   Move the new conf to `RaftServerConfigKeys.Log.Appender` and rename it to
   ```java
         String RETRY_POLICY_KEY = PREFIX + ".retry.policy";
   ```



##########
ratis-server/src/main/java/org/apache/ratis/server/impl/DivisionPropertiesImpl.java:
##########
@@ -28,6 +28,7 @@ class DivisionPropertiesImpl implements DivisionProperties {
   private final TimeDuration rpcTimeoutMax;
   private final TimeDuration rpcSleepTime;
   private final TimeDuration rpcSlownessTimeout;
+  private final String errorWaitDelayValue;

Review Comment:
   Remove this field.  Read the property directly in `GrpcLogAppender`.



##########
ratis-server-api/src/main/java/org/apache/ratis/server/RaftServerConfigKeys.java:
##########
@@ -743,6 +743,21 @@ static TimeDuration slownessTimeout(RaftProperties properties) {
     static void setSlownessTimeout(RaftProperties properties, TimeDuration expiryTime) {
       setTimeDuration(properties::setTimeDuration, SLOWNESS_TIMEOUT_KEY, expiryTime);
     }
+    
+    String ERROR_WAIT_DELAY_VALUE = PREFIX + ".error.wait.delay";
+    /*
+    * min wait time as 1ms (0 is not allowed) for first 10 
+    * (5 iteration with 2 times grpc client retry)
+    * next wait 1sec for next 20 retry (10 iteration with 2 times grpc client)
+    * further wait for 5sec for max times ((5sec*980)/2 times ~= 40min)
+     */

Review Comment:
   Use javadoc comment and align the `*`'s, i.e.
   ```java
   /**
    *
    *
    */
   ```
   



##########
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java:
##########
@@ -81,6 +81,7 @@ public class GrpcLogAppender extends LogAppenderBase {
 
   private final AutoCloseableReadWriteLock lock;
   private final StackTraceElement caller;
+  private final ErrorWaitStrategy errorWaitStrategy;

Review Comment:
   Let's directly put
   ```java
     private final RetryPolicy retryPolicy;
     private final AtomicInteger errorCount = new AtomicInteger();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo merged pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "szetszwo (via GitHub)" <gi...@apache.org>.
szetszwo merged PR #883:
URL: https://github.com/apache/ratis/pull/883


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "szetszwo (via GitHub)" <gi...@apache.org>.
szetszwo commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1212741602


##########
ratis-server/src/test/java/org/apache/ratis/RaftBasicTests.java:
##########
@@ -170,7 +170,7 @@ static void runTestBasicAppendEntries(
     final List<RaftServer.Division> divisions = cluster.getServerAliveStream().collect(Collectors.toList());
     for(RaftServer.Division impl: divisions) {
         JavaUtils.attempt(() -> RaftTestUtil.assertLogEntries(impl, term, messages),
-            5, TimeDuration.valueOf(1, TimeUnit.SECONDS), impl.getId() + " assertLogEntries", LOG);
+            5, TimeDuration.valueOf(10, TimeUnit.SECONDS), impl.getId() + " assertLogEntries", LOG);

Review Comment:
   Let's keep using 1s and change numAttempts instead, i.e.
   ```java
               50, TimeDuration.ONE_SECOND, impl.getId() + " assertLogEntries", LOG);
   ```
   Otherwise, it has to wait for 10s even if it can finish in 1s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1212500175


##########
ratis-server/src/test/java/org/apache/ratis/InstallSnapshotFromLeaderTests.java:
##########
@@ -68,6 +68,7 @@
         prop, SNAPSHOT_TRIGGER_THRESHOLD);
     RaftServerConfigKeys.Snapshot.setAutoTriggerEnabled(prop, true);
     RaftServerConfigKeys.Log.Appender.setSnapshotChunkSizeMax(prop, SizeInBytes.ONE_KB);
+    RaftServerConfigKeys.Log.Appender.setRetryPolicy(getProperties(), "1ms,1000");

Review Comment:
   Updated, added only in one testcase where it is having wait time less for cluster to come up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1209766477


##########
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java:
##########
@@ -101,6 +102,7 @@ public GrpcLogAppender(RaftServer.Division server, LeaderState leaderState, Foll
 
     lock = new AutoCloseableReadWriteLock(this);
     caller = LOG.isTraceEnabled()? JavaUtils.getCallerStackTraceElement(): null;
+    errorWaitStrategy = new ErrorWaitStrategy("5,1000,10,5000");

Review Comment:
   Updated...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] adoroszlai commented on pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #883:
URL: https://github.com/apache/ratis/pull/883#issuecomment-1563215579

   @sumitagrawl maybe I'm missing something, but `ErrWaitStrategy` very much resembles `MultipleLinearRandomRetry`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1209766672


##########
ratis-common/src/main/java/org/apache/ratis/util/ErrorWaitStrategy.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.ratis.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.atomic.AtomicLong;
+
+/*
+ * Wait strategy for incremental wait based on error count.
+ *
+ */
+public class ErrorWaitStrategy {

Review Comment:
   Updated...using this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "szetszwo (via GitHub)" <gi...@apache.org>.
szetszwo commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1206537027


##########
ratis-common/src/main/java/org/apache/ratis/util/ErrorWaitStrategy.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.ratis.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.atomic.AtomicLong;
+
+/*
+ * Wait strategy for incremental wait based on error count.
+ *
+ */
+public class ErrorWaitStrategy {

Review Comment:
   As mentioned by @adoroszlai , please use `MultipleLinearRandomRetry`.



##########
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java:
##########
@@ -101,6 +102,7 @@ public GrpcLogAppender(RaftServer.Division server, LeaderState leaderState, Foll
 
     lock = new AutoCloseableReadWriteLock(this);
     caller = LOG.isTraceEnabled()? JavaUtils.getCallerStackTraceElement(): null;
+    errorWaitStrategy = new ErrorWaitStrategy("5,1000,10,5000");

Review Comment:
   The wait times should depend on `getServer().properties().minRpcTimeoutMs()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] sumitagrawl commented on pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on PR #883:
URL: https://github.com/apache/ratis/pull/883#issuecomment-1567847786

   > Thanks @sumitagrawl for working on this. Would it be possible to use some `RetryPolicy` (see also `RetryPolicies`) for this, instead of adding a new class with similar purpose?
   
   @adoroszlai Thanks for the MultipleLinearRandomRetry class info, I have reused this for defining wait strategy reusing this.
   ErrorWaitStrategy is using this to define wait duration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ratis] szetszwo commented on a diff in pull request #883: RATIS-1836. Frequent retry by leader on failure path for appendEntry

Posted by "szetszwo (via GitHub)" <gi...@apache.org>.
szetszwo commented on code in PR #883:
URL: https://github.com/apache/ratis/pull/883#discussion_r1211020872


##########
ratis-server/src/test/java/org/apache/ratis/InstallSnapshotFromLeaderTests.java:
##########
@@ -68,6 +68,7 @@
         prop, SNAPSHOT_TRIGGER_THRESHOLD);
     RaftServerConfigKeys.Snapshot.setAutoTriggerEnabled(prop, true);
     RaftServerConfigKeys.Log.Appender.setSnapshotChunkSizeMax(prop, SizeInBytes.ONE_KB);
+    RaftServerConfigKeys.Log.Appender.setRetryPolicy(getProperties(), "1ms,1000");

Review Comment:
   @sumitagrawl  Why setting 1ms in tests?  Setting it so short essentially disable the feature.  Why not using the default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ratis.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org