You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@celeborn.apache.org by GitBox <gi...@apache.org> on 2022/11/17 11:08:10 UTC

[GitHub] [incubator-celeborn] waitinfuture opened a new pull request, #979: [CELEBORN-12] Retry on CommitFile request

waitinfuture opened a new pull request, #979:
URL: https://github.com/apache/incubator-celeborn/pull/979

   # [BUG]/[FEATURE] title
   
   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   ### What are the items that need reviewer attention?
   
   
   ### Related issues.
   
   
   ### Related pull requests.
   
   
   ### How was this patch tested?
   
   
   /cc @related-reviewer
   
   /assign @main-reviewer
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#issuecomment-1319444125

   /regression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#issuecomment-1328040038

   Duplicate with https://github.com/apache/incubator-celeborn/pull/1011


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] zhongqiangczq commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
zhongqiangczq commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1026125186


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -294,11 +297,7 @@ private[deploy] class Controller(
           },
           commitThreadPool)
 
-        if (future == null) {
-          future = task
-        } else {
-          future = CompletableFuture.allOf(future, task)
-        }
+        future = task

Review Comment:
   why removed this CompletableFuture.allOf(future, task)?only one  task?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
RexXiong commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025116606


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -324,6 +324,42 @@ private[deploy] class Controller(
       return
     }
 
+    val shuffleCommitTimeout = conf.workerShuffleCommitTimeout
+
+    shuffleCommitInfos.putIfAbsent(shuffleKey, new CommitInfo(null, CommitInfo.COMMIT_NOTSTARTED))
+    val status = shuffleCommitInfos.get(shuffleKey)

Review Comment:
   rename status to shuffleCommitInfo would be better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025153073


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -324,6 +324,42 @@ private[deploy] class Controller(
       return
     }
 
+    val shuffleCommitTimeout = conf.workerShuffleCommitTimeout
+
+    shuffleCommitInfos.putIfAbsent(shuffleKey, new CommitInfo(null, CommitInfo.COMMIT_NOTSTARTED))
+    val status = shuffleCommitInfos.get(shuffleKey)
+
+    def waitForCommitFinish(): Unit = {
+      val delta = 100
+      var times = 0
+      while (delta * times < shuffleCommitTimeout) {
+        status.synchronized {
+          if (status.status == CommitInfo.COMMIT_FINISHED) {
+            context.reply(status.response)
+            return
+          }
+        }
+        Thread.sleep(delta)
+        times += 1
+      }

Review Comment:
   > 
   
   I just added retry logic in client. The design is that worker should always process handleCommitFiles for a particular shuffleKey ONCE. In case (which I think is rare) one handleCommitFiles request comes while another is in process, then the request should wait for timeout. If that happens, the client will trigger requestCommitFiles again if not exceeds maxretries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025153073


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -324,6 +324,42 @@ private[deploy] class Controller(
       return
     }
 
+    val shuffleCommitTimeout = conf.workerShuffleCommitTimeout
+
+    shuffleCommitInfos.putIfAbsent(shuffleKey, new CommitInfo(null, CommitInfo.COMMIT_NOTSTARTED))
+    val status = shuffleCommitInfos.get(shuffleKey)
+
+    def waitForCommitFinish(): Unit = {
+      val delta = 100
+      var times = 0
+      while (delta * times < shuffleCommitTimeout) {
+        status.synchronized {
+          if (status.status == CommitInfo.COMMIT_FINISHED) {
+            context.reply(status.response)
+            return
+          }
+        }
+        Thread.sleep(delta)
+        times += 1
+      }

Review Comment:
   > 
   
   I just added retry logic in client. The design is that worker should always process handleCommitFiles for a particular shuffleKey ONCE. In case (which I think is rare) one handleCommitFiles request comes while another is in process, then the request should wait for complete or timeout. If timeout happens, the client will trigger requestCommitFiles again if not exceeds maxretries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] zhongqiangczq commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
zhongqiangczq commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025970138


##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -1605,21 +1605,29 @@ class LifecycleManager(appId: String, val conf: CelebornConf) extends RpcEndpoin
     }
   }
 
-  private def requestCommitFiles(
+  private def requestCommitFilesWithRetry(
       endpoint: RpcEndpointRef,
       message: CommitFiles): CommitFilesResponse = {
-    try {
-      endpoint.askSync[CommitFilesResponse](message)
-    } catch {
-      case e: Exception =>
-        logError(s"AskSync CommitFiles for ${message.shuffleId} failed.", e)
-        CommitFilesResponse(
-          StatusCode.FAILED,
-          List.empty.asJava,
-          List.empty.asJava,
-          message.masterIds,
-          message.slaveIds)
+    val maxRetries = conf.requestCommitFilesMaxRetries
+    var retryTimes = 0
+    while (retryTimes < maxRetries) {
+      try {
+        return endpoint.askSync[CommitFilesResponse](message)
+      } catch {
+        case e: Exception =>
+          retryTimes += 1

Review Comment:
   1. whether should need judge exception type like timeoutException? 
   2. whether  should wait somestimes before next retry?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1029373501


##########
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:
##########
@@ -1605,21 +1605,29 @@ class LifecycleManager(appId: String, val conf: CelebornConf) extends RpcEndpoin
     }
   }
 
-  private def requestCommitFiles(
+  private def requestCommitFilesWithRetry(
       endpoint: RpcEndpointRef,
       message: CommitFiles): CommitFilesResponse = {
-    try {
-      endpoint.askSync[CommitFilesResponse](message)
-    } catch {
-      case e: Exception =>
-        logError(s"AskSync CommitFiles for ${message.shuffleId} failed.", e)
-        CommitFilesResponse(
-          StatusCode.FAILED,
-          List.empty.asJava,
-          List.empty.asJava,
-          message.masterIds,
-          message.slaveIds)
+    val maxRetries = conf.requestCommitFilesMaxRetries
+    var retryTimes = 0
+    while (retryTimes < maxRetries) {
+      try {
+        return endpoint.askSync[CommitFilesResponse](message)
+      } catch {
+        case e: Exception =>
+          retryTimes += 1

Review Comment:
   1. I think we should retry regardless the Exception type.
   2. For immediate Exceptions it's useful, but for TimeoutException it's unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture closed pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture closed pull request #979: [CELEBORN-12] Retry on CommitFile request
URL: https://github.com/apache/incubator-celeborn/pull/979


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] zhongqiangczq commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
zhongqiangczq commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1026125186


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -294,11 +297,7 @@ private[deploy] class Controller(
           },
           commitThreadPool)
 
-        if (future == null) {
-          future = task
-        } else {
-          future = CompletableFuture.allOf(future, task)
-        }
+        future = task

Review Comment:
   why remove CompletableFuture.allOf(future, task)?only one  task?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
waitinfuture commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1029374497


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -294,11 +297,7 @@ private[deploy] class Controller(
           },
           commitThreadPool)
 
-        if (future == null) {
-          future = task
-        } else {
-          future = CompletableFuture.allOf(future, task)
-        }
+        future = task

Review Comment:
   Because future is guaranteed to be null according to the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

Posted by GitBox <gi...@apache.org>.
RexXiong commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025126032


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -324,6 +324,42 @@ private[deploy] class Controller(
       return
     }
 
+    val shuffleCommitTimeout = conf.workerShuffleCommitTimeout
+
+    shuffleCommitInfos.putIfAbsent(shuffleKey, new CommitInfo(null, CommitInfo.COMMIT_NOTSTARTED))
+    val status = shuffleCommitInfos.get(shuffleKey)
+
+    def waitForCommitFinish(): Unit = {
+      val delta = 100
+      var times = 0
+      while (delta * times < shuffleCommitTimeout) {
+        status.synchronized {
+          if (status.status == CommitInfo.COMMIT_FINISHED) {
+            context.reply(status.response)
+            return
+          }
+        }
+        Thread.sleep(delta)
+        times += 1
+      }

Review Comment:
   why not use direct response such as commit_in_process to client instead of let client wait network timeout,(otherwise the client may continue to retry due to network exceptions. is it true?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org