You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/07/03 06:03:59 UTC

[GitHub] [helix] pkuwm commented on a change in pull request #1109: ZkClient should not keep retrying getChildren() due to large number of children

pkuwm commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r449386378



##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -1503,12 +1555,14 @@ private void acquireEventLock() {
         } catch (Exception e) {
           throw ExceptionUtil.convertToRuntimeException(e);
         }
+
         // before attempting a retry, check whether retry timeout has elapsed
         if (System.currentTimeMillis() - operationStartTime > _operationRetryTimeoutInMillis) {
           throw new ZkTimeoutException("Operation cannot be retried because of retry timeout ("
               + _operationRetryTimeoutInMillis + " milli seconds). Retry was caused by "
               + retryCauseCode);
         }
+        LOG.warn("Retrying operation, caused by {}", retryCauseCode);

Review comment:
       From the debugging experience, it seems this is helpful log indicating retryable error like connection loss or session expired. Otherwise, it keeps retrying silently. Considering flooded logs, maybe we could have a backoff strategy or very N retry times to log it? Just a thought.
   I will not add it for now.

##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() {
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of children,
+            // because there is a higher possibility that connection loss is caused by other
+            // factors such as network connectivity, connected ZK node could not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of children.
+              // Infinitely retrying connecting may cause high GC in ZK server and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to exit retry loop.
+              // TODO: remove this check once we have a better way to exit infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number of children {} "
+                          + "exceeds limit {}, aborting retry.", path, stat.getNumChildren(),
+                      NUM_CHILDREN_LIMIT);
+                  // MarshallingErrorException could represent transport error: exceeding the
+                  // Jute buffer size. So use it to exit retry loop and tell that zk is not able to
+                  // transport the data because packet length is too large.
+                  throw new KeeperException.MarshallingErrorException();
+                } else {
+                  // No need to do stat again for next connection loss.
+                  connectionLossRetryCount = 0;

Review comment:
       @jiajunwang I would like to check numChildren every 3 connection loss, because: 
   allow trying 3 times before checking numChildren, because there is higher possibility connection loss is caused by session expired, etc. I don't want it to check numChildren for each connection loss which will downgrade the performance.
   
   Your logic here is only allow retrying 3 times, after 3 retries, exit retry loop. This is not what I want. I would still keep the original logic to retry on retry-able connection loss.

##########
File path: zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() {
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of children,
+            // because there is a higher possibility that connection loss is caused by other
+            // factors such as network connectivity, connected ZK node could not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of children.
+              // Infinitely retrying connecting may cause high GC in ZK server and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to exit retry loop.
+              // TODO: remove this check once we have a better way to exit infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number of children {} "

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org