You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@celeborn.apache.org by GitBox <gi...@apache.org> on 2022/11/17 12:55:37 UTC

[GitHub] [incubator-celeborn] waitinfuture commented on a diff in pull request #979: [CELEBORN-12] Retry on CommitFile request

waitinfuture commented on code in PR #979:
URL: https://github.com/apache/incubator-celeborn/pull/979#discussion_r1025153073


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala:
##########
@@ -324,6 +324,42 @@ private[deploy] class Controller(
       return
     }
 
+    val shuffleCommitTimeout = conf.workerShuffleCommitTimeout
+
+    shuffleCommitInfos.putIfAbsent(shuffleKey, new CommitInfo(null, CommitInfo.COMMIT_NOTSTARTED))
+    val status = shuffleCommitInfos.get(shuffleKey)
+
+    def waitForCommitFinish(): Unit = {
+      val delta = 100
+      var times = 0
+      while (delta * times < shuffleCommitTimeout) {
+        status.synchronized {
+          if (status.status == CommitInfo.COMMIT_FINISHED) {
+            context.reply(status.response)
+            return
+          }
+        }
+        Thread.sleep(delta)
+        times += 1
+      }

Review Comment:
   > 
   
   I just added retry logic in client. The design is that worker should always process handleCommitFiles for a particular shuffleKey ONCE. In case (which I think is rare) one handleCommitFiles request comes while another is in process, then the request should wait for timeout. If that happens, the client will trigger requestCommitFiles again if not exceeds maxretries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org