You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@celeborn.apache.org by "AngersZhuuuu (via GitHub)" <gi...@apache.org> on 2023/04/03 09:10:47 UTC

[GitHub] [incubator-celeborn] AngersZhuuuu opened a new pull request, #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

AngersZhuuuu opened a new pull request, #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1586139255

   Hi @AngersZhuuuu , this PR seems very necessary, it'll be very nice if you make it done : )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] codecov[bot] commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1493987628

   ## [Codecov](https://codecov.io/gh/apache/incubator-celeborn/pull/1406?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1406](https://codecov.io/gh/apache/incubator-celeborn/pull/1406?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9c7d6a6) into [main](https://codecov.io/gh/apache/incubator-celeborn/commit/015788dd283cd0e820f4347e6c9b78d6bc7b4fd8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (015788d) will **decrease** coverage by `0.08%`.
   > The diff coverage is `77.78%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##             main    #1406      +/-   ##
   ==========================================
   - Coverage   45.01%   44.92%   -0.08%     
   ==========================================
     Files         164      164              
     Lines       10404    10413       +9     
     Branches     1057     1057              
   ==========================================
   - Hits         4682     4677       -5     
   - Misses       5382     5394      +12     
   - Partials      340      342       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-celeborn/pull/1406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...he/celeborn/common/protocol/PartitionLocation.java](https://codecov.io/gh/apache/incubator-celeborn/pull/1406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jZWxlYm9ybi9jb21tb24vcHJvdG9jb2wvUGFydGl0aW9uTG9jYXRpb24uamF2YQ==) | `70.41% <50.00%> (-0.42%)` | :arrow_down: |
   | [...cala/org/apache/celeborn/common/CelebornConf.scala](https://codecov.io/gh/apache/incubator-celeborn/pull/1406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y29tbW9uL3NyYy9tYWluL3NjYWxhL29yZy9hcGFjaGUvY2VsZWJvcm4vY29tbW9uL0NlbGVib3JuQ29uZi5zY2FsYQ==) | `86.29% <85.72%> (-<0.01%)` | :arrow_down: |
   
   ... and [4 files with indirect coverage changes](https://codecov.io/gh/apache/incubator-celeborn/pull/1406/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1227452982


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,57 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage() != null
+              && e.getMessage().startsWith("Connecting to")
+              && e.getMessage().contains("timed out");
+      boolean rpcTimeout =
+          e instanceof IOException
+              && e.getCause() != null
+              && e.getCause() instanceof TimeoutException;
+      boolean fetchChunkTimeout =
+          e instanceof CelebornIOException
+              && e.getCause() != null
+              && e.getCause() instanceof IOException;
+      return isConnectTimeout || rpcTimeout || fetchChunkTimeout;
+    }
+
     private PartitionReader createReaderWithRetry(PartitionLocation location) throws IOException {
       while (fetchChunkRetryCnt < fetchChunkMaxRetry) {
         try {
+          if (isBlacklisted(location)) {

Review Comment:
   > There may have three partitionLocation pair, such as A/B,B/C, A/C. If worker A And C exclude by first two, then if we try A/C, there is no partitionLocations left.
   
   Change the behavior, how about current?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1498618209

   @waitinfuture How about current?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1226021820


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,56 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage().startsWith("Connecting to")

Review Comment:
   > please guard `e.getMessage() != null`
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] github-actions[bot] closed pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch
URL: https://github.com/apache/incubator-celeborn/pull/1406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "RexXiong (via GitHub)" <gi...@apache.org>.
RexXiong commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1227421206


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,57 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage() != null
+              && e.getMessage().startsWith("Connecting to")
+              && e.getMessage().contains("timed out");
+      boolean rpcTimeout =
+          e instanceof IOException
+              && e.getCause() != null
+              && e.getCause() instanceof TimeoutException;
+      boolean fetchChunkTimeout =
+          e instanceof CelebornIOException
+              && e.getCause() != null
+              && e.getCause() instanceof IOException;
+      return isConnectTimeout || rpcTimeout || fetchChunkTimeout;
+    }
+
     private PartitionReader createReaderWithRetry(PartitionLocation location) throws IOException {
       while (fetchChunkRetryCnt < fetchChunkMaxRetry) {
         try {
+          if (isBlacklisted(location)) {

Review Comment:
   There may have three partitionLocation pair, such as A/B,B/C, A/C. If worker A And C exclude by first two, then if we try A/C, there is no partitionLocations left.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1494018349

   > Test cases seems very positive in worker-unavailable scenarios. I'm thinking whether we should add mechanism to remove workers from blacklist, since we can't guarantee the worker is broken or caused by fluctuation.
   
   For the same PartitionLocation, if the peer has failed, how about allowing the blacklisted machine to retry?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1494013519

   Thanks for the PR @AngersZhuuuu ! Test cases seems very positive in worker-unavailable scenarios. I'm thinking whether we should add mechanism to remove workers from blacklist, since we can't guarantee the worker is broken or caused by fluctuation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1588513970

   Any more suggestion? @pan3793 @RexXiong @waitinfuture 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1586511581

   After this pr I will make a new pr to unify the blacklist and execluded related configuration name.
   Any more suggestion? @pan3793 @waitinfuture 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1589274531

   LGTM, thanks! I'm thinking whether we should unify push blacklist and fetch blacklist. cc @AngersZhuuuu @pan3793 @RexXiong 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] pan3793 commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1225814118


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,56 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage().startsWith("Connecting to")

Review Comment:
   please guard `e.getMessage() != null`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1589080459

   Before
   <img width="1791" alt="截屏2023-06-13 下午7 05 53" src="https://github.com/apache/incubator-celeborn/assets/46485123/307d17a4-5706-4326-834f-358297c04741">
   
   After
   <img width="1054" alt="截屏2023-06-13 下午7 04 50" src="https://github.com/apache/incubator-celeborn/assets/46485123/0460bfa5-12d2-42cb-933f-e5abda0d8ac7">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1586170562

   > 
   
   +1, shall we do code refactor before further related PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1495223920

   > 
   
   Shall we add a timeout for blacklisted worker?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] github-actions[bot] commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1537357832

   This issue was closed because it has been staled for 10 days with no activity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong closed pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "RexXiong (via GitHub)" <gi...@apache.org>.
RexXiong closed pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch
URL: https://github.com/apache/incubator-celeborn/pull/1406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] pan3793 commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1586165686

   It seems like we are mixing up the terms "excludedWorker" and "blacklist". 
   
   First of all, I don't object "blacklist", but we should unify them.
   
   There are some options I think:
   
   1. use "blacklist" anywhere
   2. use "exclusion" anywhere, even it does not mention excluding what, but it should be clear we are excluding bad workers
   3. use "excludedWorker" or "workerExclusion" based on the context.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] github-actions[bot] commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1523001428

   This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] RexXiong commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "RexXiong (via GitHub)" <gi...@apache.org>.
RexXiong commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1226274691


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -399,6 +461,7 @@ public void close() {
         currentReader.close();
         currentReader = null;
       }
+      fetchBlacklist.clear();

Review Comment:
    For fetchBlacklist shared by all RssInputStreams, why one single rssInputStream can clear this?



##########
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala:
##########
@@ -2825,6 +2828,22 @@ object CelebornConf extends Logging {
       .intConf
       .createWithDefault(3)
 
+  val CLIENT_FETCH_BLACKLIST_ENABLED: ConfigEntry[Boolean] =
+    buildConf("celeborn.client.fetch.blacklist.enabled")
+      .categories("client")
+      .doc("Whether to enable shuffle client-side fetch blacklist of workers.")
+      .version("0.3.0")
+      .booleanConf
+      .createWithDefault(false)
+
+  val CLIENT_FETCH_EXCLUDED_WORKER_EXPIRE_TIMEOUT: ConfigEntry[Long] =
+    buildConf("celeborn.client.fetch.exclude.expireTimeout")
+      .categories("client")
+      .doc("ShuffleClint is a static object, it will be used in the whole lifecycle of Executor," +

Review Comment:
   ShuffleClint -> ShuffleClient



##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,57 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage() != null
+              && e.getMessage().startsWith("Connecting to")
+              && e.getMessage().contains("timed out");
+      boolean rpcTimeout =
+          e instanceof IOException
+              && e.getCause() != null
+              && e.getCause() instanceof TimeoutException;
+      boolean fetchChunkTimeout =
+          e instanceof CelebornIOException
+              && e.getCause() != null
+              && e.getCause() instanceof IOException;
+      return isConnectTimeout || rpcTimeout || fetchChunkTimeout;
+    }
+
     private PartitionReader createReaderWithRetry(PartitionLocation location) throws IOException {
       while (fetchChunkRetryCnt < fetchChunkMaxRetry) {
         try {
+          if (isBlacklisted(location)) {

Review Comment:
   If the two replica already in blacklist(by others). There may be no partition location for this input stream can retry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1226366291


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -399,6 +461,7 @@ public void close() {
         currentReader.close();
         currentReader = null;
       }
+      fetchBlacklist.clear();

Review Comment:
   > For fetchBlacklist shared by all RssInputStreams, why one single rssInputStream can clear this?
   
   Only work for one RssInputStreamImpl



##########
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala:
##########
@@ -2825,6 +2828,22 @@ object CelebornConf extends Logging {
       .intConf
       .createWithDefault(3)
 
+  val CLIENT_FETCH_BLACKLIST_ENABLED: ConfigEntry[Boolean] =
+    buildConf("celeborn.client.fetch.blacklist.enabled")
+      .categories("client")
+      .doc("Whether to enable shuffle client-side fetch blacklist of workers.")
+      .version("0.3.0")
+      .booleanConf
+      .createWithDefault(false)
+
+  val CLIENT_FETCH_EXCLUDED_WORKER_EXPIRE_TIMEOUT: ConfigEntry[Long] =
+    buildConf("celeborn.client.fetch.exclude.expireTimeout")
+      .categories("client")
+      .doc("ShuffleClint is a static object, it will be used in the whole lifecycle of Executor," +

Review Comment:
   DOne



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on a diff in pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on code in PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#discussion_r1226369050


##########
client/src/main/java/org/apache/celeborn/client/read/RssInputStream.java:
##########
@@ -226,11 +237,57 @@ private void moveToNextReader() throws IOException {
       currentChunk = getNextChunk();
     }
 
+    private void blacklistFailedLocation(PartitionLocation location, Exception e) {
+      if (conf.clientPushReplicateEnabled() && fetchBlacklistEnabled) {
+        if (criticalCause(e)) {
+          fetchBlacklist.put(location.hostAndFetchPort(), System.currentTimeMillis());
+          // Avoid one data's two replicate both can't be fetched.
+          if (location.getPeer() != null) {
+            fetchBlacklist.remove(location.getPeer().hostAndFetchPort());
+          }
+        }
+      }
+    }
+
+    // In RssInputStream, all operation is sync.
+    private boolean isBlacklisted(PartitionLocation location) {
+      Long timestamp = fetchBlacklist.get(location.hostAndFetchPort());
+      if (timestamp == null) {
+        return false;
+      } else if (System.currentTimeMillis() - timestamp > fetchExcludedWorkerExpireTimeout) {
+        fetchBlacklist.remove(location.hostAndFetchPort());
+        return false;
+      } else {
+        return true;
+      }
+    }
+
+    private boolean criticalCause(Exception e) {
+      boolean isConnectTimeout =
+          e instanceof IOException
+              && e.getMessage() != null
+              && e.getMessage().startsWith("Connecting to")
+              && e.getMessage().contains("timed out");
+      boolean rpcTimeout =
+          e instanceof IOException
+              && e.getCause() != null
+              && e.getCause() instanceof TimeoutException;
+      boolean fetchChunkTimeout =
+          e instanceof CelebornIOException
+              && e.getCause() != null
+              && e.getCause() instanceof IOException;
+      return isConnectTimeout || rpcTimeout || fetchChunkTimeout;
+    }
+
     private PartitionReader createReaderWithRetry(PartitionLocation location) throws IOException {
       while (fetchChunkRetryCnt < fetchChunkMaxRetry) {
         try {
+          if (isBlacklisted(location)) {

Review Comment:
   Avoid this in `blacklistFailedLocation`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] waitinfuture commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "waitinfuture (via GitHub)" <gi...@apache.org>.
waitinfuture commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1495236185

   > > Test cases seems very positive in worker-unavailable scenarios. I'm thinking whether we should add mechanism to remove workers from blacklist, since we can't guarantee the worker is broken or caused by fluctuation.
   > 
   > For the same PartitionLocation, if the peer has failed, how about allowing the blacklisted machine to retry?
   
   In this PR, if both workers are blacklist, then the fetch will fast fail, maybe it's not what we want. We can add timeout for each blacklisted worker, and only consider a worker unavailable if at least two(configurable) exceptions happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-celeborn] AngersZhuuuu commented on pull request #1406: [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch

Posted by "AngersZhuuuu (via GitHub)" <gi...@apache.org>.
AngersZhuuuu commented on PR #1406:
URL: https://github.com/apache/incubator-celeborn/pull/1406#issuecomment-1493964379

   ping @waitinfuture @RexXiong @FMX 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org