You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/05/05 07:27:12 UTC

[GitHub] [helix] kaisun2000 commented on a change in pull request #970: Add async call retry to resolve the transient ZK connection issue.

kaisun2000 commented on a change in pull request #970:
URL: https://github.com/apache/helix/pull/970#discussion_r419899421



##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java
##########
@@ -172,22 +193,52 @@ public int getRc() {
       return _rc;
     }
 
+    @Override
+    public void notifyCallers() {
+      LOG.warn("The callback {} has been cancelled.", this);
+      markOperationDone();
+    }
+
+    /**
+     * Additional callback handling.
+     */
     abstract public void handle();
-  }
 
-  public static class ZkAsyncCallContext {
-    private long _startTimeMilliSec;
-    private int _bytes;
-    private ZkClientMonitor _monitor;
-    private boolean _isRead;
+    private void markOperationDone() {
+      synchronized (_isOperationDone) {
+        _isOperationDone.set(true);
+        _isOperationDone.notifyAll();
+      }
+    }
 
-    public ZkAsyncCallContext(final ZkClientMonitor monitor, long startTimeMilliSec, int bytes,
-        boolean isRead) {
-      _monitor = monitor;
-      _startTimeMilliSec = startTimeMilliSec;
-      _bytes = bytes;
-      _isRead = isRead;
+    /**
+     * @param rc the return code
+     * @return true if the error is transient and the operation may succeed when being retried.
+     */
+    private boolean needRetry(int rc) {
+      try {
+        switch (Code.get(rc)) {
+        /** Connection to the server has been lost */
+        case CONNECTIONLOSS:
+          /** The session has been expired by the server */
+        case SESSIONEXPIRED:
+          /** Session moved to another server, so operation is ignored */

Review comment:
       These Aync call is normally for batch access from ZkBaseDataAccessor I believe.  Here, the idea is to not create ephemeral nodes because SESSIONEXPIRED can be retry. Then we should probably fail ephemeral code creating asyncly too, right?

##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -183,6 +190,9 @@ protected ZkClient(IZkConnection zkConnection, int connectionTimeout, long opera
     _operationRetryTimeoutInMillis = operationRetryTimeout;
     _isNewSessionEventFired = false;
 
+    _asyncCallRetryThread = new ZkAsyncRetryThread(zkConnection.getServers());

Review comment:
       We should give name of this thread that can be tied to the ZkEvent thread name. This way, when we debug it, we know the relation. Otherwise it would be very hard to correlate and reason.
   

##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java
##########
@@ -122,44 +121,66 @@ public void handle() {
   }
 
   /**
-   * Default callback for zookeeper async api
+   * Default callback for zookeeper async api.
    */
-  public static abstract class DefaultCallback {
-    AtomicBoolean _lock = new AtomicBoolean(false);
-    int _rc = -1;
+  public static abstract class DefaultCallback implements CancellableZkAsyncCallback {
+    AtomicBoolean _isOperationDone = new AtomicBoolean(false);
+    int _rc = UNKNOWN_RET_CODE;

Review comment:
       why change this value from -1 to 255?

##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/callback/ZkAsyncCallbacks.java
##########
@@ -122,44 +121,66 @@ public void handle() {
   }
 
   /**
-   * Default callback for zookeeper async api
+   * Default callback for zookeeper async api.
    */
-  public static abstract class DefaultCallback {
-    AtomicBoolean _lock = new AtomicBoolean(false);
-    int _rc = -1;
+  public static abstract class DefaultCallback implements CancellableZkAsyncCallback {
+    AtomicBoolean _isOperationDone = new AtomicBoolean(false);
+    int _rc = UNKNOWN_RET_CODE;
 
     public void callback(int rc, String path, Object ctx) {
       if (rc != 0 && LOG.isDebugEnabled()) {
         LOG.debug(this + ", rc:" + Code.get(rc) + ", path: " + path);
       }
 
-      if (ctx != null && ctx instanceof ZkAsyncCallContext) {
-        ZkAsyncCallContext zkCtx = (ZkAsyncCallContext) ctx;
-        if (zkCtx._monitor != null) {
-          if (zkCtx._isRead) {
-            zkCtx._monitor.record(path, zkCtx._bytes, zkCtx._startTimeMilliSec,
-                ZkClientMonitor.AccessType.READ);
-          } else {
-            zkCtx._monitor.record(path, zkCtx._bytes, zkCtx._startTimeMilliSec,
-                ZkClientMonitor.AccessType.WRITE);
-          }
-        }
+      if (ctx != null && ctx instanceof ZkAsyncCallMonitorContext) {
+        ((ZkAsyncCallMonitorContext) ctx).recordAccess(path);
       }
 
       _rc = rc;
-      handle();
 
-      synchronized (_lock) {
-        _lock.set(true);
-        _lock.notify();
+      // If retry is requested by passing the retry callback context, do retry if necessary.
+      if (needRetry(rc)) {
+        if (ctx != null && ctx instanceof ZkAsyncRetryCallContext) {
+          try {
+            if (((ZkAsyncRetryCallContext) ctx).requestRetry()) {
+              // The retry operation will be done asynchronously. Once it is done, the same callback
+              // handler object shall be triggered to ensure the result is notified to the right
+              // caller(s).
+              return;
+            } else {
+              LOG.warn(
+                  "Cannot request to retry the operation. The retry request thread may have been stopped.");
+            }
+          } catch (Throwable t) {
+            LOG.error("Failed to request to retry the operation.", t);

Review comment:
       Need retry, retri-able context, but retry operation failed? What to do here? Mark done, return some retriable RC value like CONNECTIONLOSS is not what the customer expect to handle right? 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org