You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/10/24 08:57:41 UTC

[GitHub] [incubator-uniffle] xianjingfeng opened a new pull request, #276: Add fallback mechanism for blocks read inconsistent

xianjingfeng opened a new pull request, #276:
URL: https://github.com/apache/incubator-uniffle/pull/276

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://github.com/Tencent/Firestorm/blob/master/CONTRIBUTING.md
     2. Ensure you have added or run the appropriate tests for your PR
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]XXXX Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Add fallback mechanism for blocks read inconsistent
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   When the data in this first server is damaged, application will fail. #124 #129 
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Already added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1293047722

   > > MaxFailureTime will tolerate the logic of replicas, this is my biggest concern.
   > 
   > I don't understand
   
   For replica logic, if we use 7 replicas, we should read 4 replica successfully, but if maxFailure is 3 . Although we have 4 correct replicas, the application will fail because the app  may read 3 wrong replicas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1324464359

   Could we use the order 
   server A memory -> server A LOCALFILE -> server A HDFS 
   server B memory -> server B LOCALFILE -> server B HDFS
   server C memory -> server C LOCALFILE -> server C HDFS
   or the order
   server A memory -> server B memory -> server C memory
   server A LOCALFILE -> server B LOCALFILE -> server C LOCALFILE
   HDFS
   
   Other order may produce problems because memory data may be flushed after we mark the handler finished state.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032493600


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MultiReplicaClientReadHandler.java:
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.storage.handler.impl;
+
+import java.util.List;
+
+import com.google.common.collect.Lists;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.common.BufferSegment;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.common.exception.RssException;
+import org.apache.uniffle.common.util.RssUtils;
+import org.apache.uniffle.storage.factory.ShuffleHandlerFactory;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.request.CreateShuffleReadHandlerRequest;
+
+public class MultiReplicaClientReadHandler extends AbstractClientReadHandler {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MultiReplicaClientReadHandler.class);
+
+  private List<ClientReadHandler> handlers = Lists.newLinkedList();
+
+  private long readBlockNum = 0L;
+  private long readLength = 0L;
+  private long readUncompressLength = 0L;
+  private Roaring64NavigableMap blockIdBitmap;
+  private Roaring64NavigableMap processedBlockIds;
+
+  private int readHandlerIndex;
+
+  public MultiReplicaClientReadHandler(

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291819840

   > * Should we use method `next()` instead of `fallback()`?
   > * We shouldn't have the concept of `maxFallbackTimes`.
   
   How about fallback() -> nextRound() and maxFallbackTimes -> maxRounds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012577150


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   `HdfsClientReadHandler` catches this exception and moves forward the `readHandlerIndex ` according to the diff in `HdfsClientReadHandler`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012638529


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   Yes,you can search `nextRound` in this pr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020852808


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > we won't retry this handler because we throw the exception here
   
   Why not? It just will not retry in current round, but it will retry in next round.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020885432


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > But in this case the fail times of LocalFileQuorumClientReadHandler reach the maxHandlerFailTimes, LocalFileQuorumClientReadHandler have three LocalFileClientReadHandlers. Every LocalFileClientReadHandler don't reach maxHandlerFailTimes
   
   `LocalFileQuorumClientReadHandler`,  `MemoryQuorumClientReadHandler` and `HdfsClientReadHandler` have no `maxHandlerFailTimes`. If all low-level handlers finished, they are finished.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] codecov-commenter commented on pull request #276: Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1288869689

   # [Codecov](https://codecov.io/gh/apache/incubator-uniffle/pull/276?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#276](https://codecov.io/gh/apache/incubator-uniffle/pull/276?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0e369de) into [master](https://codecov.io/gh/apache/incubator-uniffle/commit/47effb25044780013ff51cc516261464435b5829?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (47effb2) will **decrease** coverage by `1.60%`.
   > The diff coverage is `19.04%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master     #276      +/-   ##
   ============================================
   - Coverage     59.71%   58.10%   -1.61%     
   + Complexity     1377     1303      -74     
   ============================================
     Files           166      157       -9     
     Lines          8918     8492     -426     
     Branches        853      840      -13     
   ============================================
   - Hits           5325     4934     -391     
   + Misses         3318     3287      -31     
   + Partials        275      271       -4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-uniffle/pull/276?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...torage/handler/impl/AbstractClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9BYnN0cmFjdENsaWVudFJlYWRIYW5kbGVyLmphdmE=) | `16.66% <0.00%> (-3.34%)` | :arrow_down: |
   | [...torage/handler/impl/ComposedClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9Db21wb3NlZENsaWVudFJlYWRIYW5kbGVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...e/storage/handler/impl/HdfsShuffleReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9IZGZzU2h1ZmZsZVJlYWRIYW5kbGVyLmphdmE=) | `57.14% <0.00%> (ø)` | |
   | [...handler/impl/LocalFileQuorumClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9Mb2NhbEZpbGVRdW9ydW1DbGllbnRSZWFkSGFuZGxlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../storage/handler/impl/MemoryClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9NZW1vcnlDbGllbnRSZWFkSGFuZGxlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ge/handler/impl/MemoryQuorumClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9NZW1vcnlRdW9ydW1DbGllbnRSZWFkSGFuZGxlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...storage/handler/impl/DataSkippableReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9EYXRhU2tpcHBhYmxlUmVhZEhhbmRsZXIuamF2YQ==) | `75.00% <25.00%> (-9.38%)` | :arrow_down: |
   | [...le/storage/handler/impl/HdfsClientReadHandler.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c3RvcmFnZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvdW5pZmZsZS9zdG9yYWdlL2hhbmRsZXIvaW1wbC9IZGZzQ2xpZW50UmVhZEhhbmRsZXIuamF2YQ==) | `63.21% <25.00%> (-12.13%)` | :arrow_down: |
   | [...che/uniffle/client/impl/ShuffleReadClientImpl.java](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL2NsaWVudC9pbXBsL1NodWZmbGVSZWFkQ2xpZW50SW1wbC5qYXZh) | `87.96% <83.33%> (-1.73%)` | :arrow_down: |
   | ... and [11 more](https://codecov.io/gh/apache/incubator-uniffle/pull/276/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003949297


##########
client/src/main/java/org/apache/uniffle/client/impl/ShuffleReadClientImpl.java:
##########
@@ -60,6 +60,8 @@ public class ShuffleReadClientImpl implements ShuffleReadClient {
   private AtomicLong crcCheckTime = new AtomicLong(0);
   private ClientReadHandler clientReadHandler;
   private final IdHelper idHelper;
+  private int fallbackTimes;
+  private int maxFallbackTimes = 3;

Review Comment:
   Most jobs can run successfully without fallback mechanism, so it think 3 is enough. And it's difficult to pass parameters in the client. Should we add an new configration class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012583536


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   Yes, if the first `hdfsShuffleFileReader` fail, it will read the next one. But how about all `hdfsShuffleFileReader` fail for network temporarily unavailable. Do i get your point?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1324474542

   > Could we use the order server A memory -> server A LOCALFILE -> server A HDFS server B memory -> server B LOCALFILE -> server B HDFS server C memory -> server C LOCALFILE -> server C HDFS or the order server A memory -> server B memory -> server C memory server A LOCALFILE -> server B LOCALFILE -> server C LOCALFILE HDFS
   > 
   > Other order may produce problems because memory data may be flushed after we mark the handler finished state.
   
   I think the first solution is better, i will try to do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291774843

   > One question: Should we use the concept of `FALLBACK`. In my opinion, we use fallback mechanism to process error or exception, but it's normal for us to read memory, localfile and hdfs at the same time.
   
   The major reasion is the client just read from one shuffle server, it is a bug. #124


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1005168462


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsClientReadHandler.java:
##########
@@ -127,21 +127,32 @@ public ShuffleDataResult readShuffleData() {
       init(fullShufflePath);
     }
 
-    if (readHandlerIndex >= readHandlers.size()) {
+    if (finished()) {
       return new ShuffleDataResult();
     }
 
-    HdfsShuffleReadHandler hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-    ShuffleDataResult shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-
-    while (shuffleDataResult == null) {
-      ++readHandlerIndex;
+    HdfsShuffleReadHandler hdfsShuffleFileReader;

Review Comment:
   I think no nessential, for unify?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1288894417

   > What's the relation between this pr and #129 ?
   
   As what we discuss in #129, #129 is not the best and it will be split into multiple prs and this pr is one of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032448365


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MultiReplicaClientReadHandler.java:
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.storage.handler.impl;
+
+import java.util.List;
+
+import com.google.common.collect.Lists;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.common.BufferSegment;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.common.exception.RssException;
+import org.apache.uniffle.common.util.RssUtils;
+import org.apache.uniffle.storage.factory.ShuffleHandlerFactory;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.request.CreateShuffleReadHandlerRequest;
+
+public class MultiReplicaClientReadHandler extends AbstractClientReadHandler {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MultiReplicaClientReadHandler.class);
+
+  private List<ClientReadHandler> handlers = Lists.newLinkedList();
+
+  private long readBlockNum = 0L;
+  private long readLength = 0L;
+  private long readUncompressLength = 0L;
+  private Roaring64NavigableMap blockIdBitmap;
+  private Roaring64NavigableMap processedBlockIds;
+
+  private int readHandlerIndex;
+
+  public MultiReplicaClientReadHandler(

Review Comment:
   Could we put the creation logic into ShuffleHandlerFactory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032497732


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsClientReadHandler.java:
##########
@@ -154,23 +158,26 @@ public ShuffleDataResult readShuffleData() {
       init(fullShufflePath);
     }
 
-    if (readHandlerIndex >= readHandlers.size()) {
-      return new ShuffleDataResult();
-    }
-
-    HdfsShuffleReadHandler hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-    ShuffleDataResult shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-
-    while (shuffleDataResult == null) {
-      ++readHandlerIndex;
+    HdfsShuffleReadHandler hdfsShuffleFileReader;
+    ShuffleDataResult shuffleDataResult;

Review Comment:
   Why do we modify HDFS's logic? Is it necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032747983


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsShuffleReadHandler.java:
##########
@@ -94,8 +94,8 @@ protected ShuffleIndexResult readShuffleIndex() {
       return new ShuffleIndexResult(indexData, dateFileLen);
     } catch (Exception e) {
       LOG.info("Fail to read index files {}.index", filePrefix, e);
+      throw e;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1324489740

   > > Could we use the order server A memory -> server A LOCALFILE -> server A HDFS server B memory -> server B LOCALFILE -> server B HDFS server C memory -> server C LOCALFILE -> server C HDFS or the order server A memory -> server B memory -> server C memory server A LOCALFILE -> server B LOCALFILE -> server C LOCALFILE HDFS
   > > Other order may produce problems because memory data may be flushed after we mark the handler finished state.
   > 
   > I think the first solution is better, i will try to do it.
   
   We'd better have a layer abstraction to encapsulate the order logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020850428


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   For 2nd point:
   If the network is unavailable temporarily, we won't retry this handler because we throw the exception here. Is it necessary that we add the failTimes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1329349295

   Merged. thanks all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012719982


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   So it may loop the handlers a second time, skip the finished ones and some handlers may be still unfinished for the maxHandlerFailTimes might be greater than 2? 
   I have two suggestions, could you help to review?
   - Do the retry in the upper layer and only keep state data such as the finish flag and `segmentIndex` in the low-level handlers, and derive the finish flag from the `segmentIndex`, which can be also used as a checkpoint for retry.
   - **Do Not** expose the retry logic to the `ShuffleClient`, make it cohesive in the `ClientReadHandler` and can be configured.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012793429


##########
storage/src/main/java/org/apache/uniffle/storage/request/CreateShuffleReadHandlerRequest.java:
##########
@@ -41,6 +41,7 @@ public class CreateShuffleReadHandlerRequest {
   private List<ShuffleServerInfo> shuffleServerInfoList;
   private Roaring64NavigableMap expectBlockIds;
   private Roaring64NavigableMap processBlockIds;
+  private int maxHanderFailTimes;

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [WIP][ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1031685902


##########
storage/src/main/java/org/apache/uniffle/storage/factory/ShuffleHandlerFactory.java:
##########
@@ -53,80 +54,84 @@ public static synchronized ShuffleHandlerFactory getInstance() {
     return INSTANCE;
   }
 
+
   public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request) {
+    if (CollectionUtils.isEmpty(request.getShuffleServerInfoList())) {
+      throw new RuntimeException("Shuffle servers should not be empty!");
+    }
+    if (request.getShuffleServerInfoList().size() > 1) {

Review Comment:
   I think @xianjingfeng current implement is ok. We need a replicaHandler concept as a upper layer of composite handler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1010375603


##########
integration-test/common/src/test/java/org/apache/uniffle/test/ShuffleServerFaultToleranceTest.java:
##########
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.test;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.io.Files;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import org.apache.uniffle.client.TestUtils;
+import org.apache.uniffle.client.api.ShuffleServerClient;
+import org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient;
+import org.apache.uniffle.client.request.RssRegisterShuffleRequest;
+import org.apache.uniffle.client.request.RssSendShuffleDataRequest;
+import org.apache.uniffle.common.PartitionRange;
+import org.apache.uniffle.common.ShuffleBlockInfo;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.coordinator.CoordinatorConf;
+import org.apache.uniffle.coordinator.CoordinatorServer;
+import org.apache.uniffle.server.MockedShuffleServer;
+import org.apache.uniffle.server.ShuffleServer;
+import org.apache.uniffle.server.ShuffleServerConf;
+import org.apache.uniffle.server.buffer.ShuffleBuffer;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.ComposedClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.MemoryQuorumClientReadHandler;
+import org.apache.uniffle.storage.util.StorageType;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.fail;
+
+public class ShuffleServerFaultToleranceTest extends ShuffleReadWriteBase {
+
+  private List<ShuffleServerClient> shuffleServerClients;
+
+  @BeforeAll
+  public static void setupServers() throws Exception {
+    CoordinatorConf coordinatorConf = getCoordinatorConf();
+    createCoordinatorServer(coordinatorConf);
+    shuffleServers.add(createServer(0));
+    shuffleServers.add(createServer(1));
+    shuffleServers.add(createServer(2));
+    startServers();
+  }
+
+  @BeforeEach
+  public void createClient() {
+    shuffleServerClients = new ArrayList<>();
+    for (ShuffleServer shuffleServer : shuffleServers) {
+      shuffleServerClients.add(new ShuffleServerGrpcClient(shuffleServer.getIp(), shuffleServer.getPort()));
+    }
+  }
+
+  @AfterEach
+  public void cleanEnv() throws Exception {
+    shuffleServerClients.forEach((client) -> {
+      client.close();
+    });
+    cleanCluster();
+    setupServers();
+  }
+
+  @Test
+  public void testReadFaultTolerance() throws Exception {
+    String testAppId = "ShuffleServerFaultToleranceTest.testReadFaultTolerance";

Review Comment:
   I can't get your point. Can you explain what fault is involved in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1293053016

   > > For replica logic, if we use 7 replicas, we should read 4 replica successfully, but if maxFailure is 3 . Although we have 4 correct replicas, the application will fail because the app may read 3 wrong replicas.
   > 
   > I mean `maxFailTimes` is for each replica.
   
   It seems ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020883455


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I see your case. Terrific! But in this case the fail times of LocalFileQuorumClientReadHandler reach the  maxHandlerFailTimes, LocalFileQuorumClientReadHandler have three LocalFileClientReadHandlers. Every  LocalFileClientReadHandler don't reach maxHandlerFailTimes. I mean that the fail times of LocalFileClientReadHandler  isn't meaningful. So could we change the code like my sayings? or the fail times of DataSkippableReadHandler is only used for HDFSFileClientReadHandler? 
   My concern is that the fail times of LocalFileClientReadHandler seems meaningless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1030332793


##########
storage/src/main/java/org/apache/uniffle/storage/factory/ShuffleHandlerFactory.java:
##########
@@ -54,79 +53,84 @@ public static synchronized ShuffleHandlerFactory getInstance() {
   }
 
   public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request) {
+    return createShuffleReadHandler(request, null);
+  }
+
+  public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request,
+                                                    ShuffleServerInfo serverInfo) {
+    if (serverInfo == null && request.getShuffleServerInfoList().size() > 1) {

Review Comment:
   The logic is a little tricky.  Maybe they can extract the common logic to a depend method.  We can split this method to two methods. It's weird that we judge serverInfo whether is empty and decide which handler to create.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] frankliee commented on a diff in pull request #276: [WIP][ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
frankliee commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1031627360


##########
common/src/main/java/org/apache/uniffle/common/ShuffleServerInfo.java:
##########
@@ -27,6 +27,13 @@ public class ShuffleServerInfo implements Serializable {
 
   private int port;
 
+  // Only for test
+  public ShuffleServerInfo(String host, int port) {
+    this.id = host + "-" + port;
+    this.host = host;

Review Comment:
   It is a little strange to add a constructor just for test,  we can just use 
   `new ShuffleServerInfo(host + "_" + String.valueOf(port), host, port)` 



##########
storage/src/main/java/org/apache/uniffle/storage/factory/ShuffleHandlerFactory.java:
##########
@@ -53,80 +54,84 @@ public static synchronized ShuffleHandlerFactory getInstance() {
     return INSTANCE;
   }
 
+
   public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request) {
+    if (CollectionUtils.isEmpty(request.getShuffleServerInfoList())) {
+      throw new RuntimeException("Shuffle servers should not be empty!");
+    }
+    if (request.getShuffleServerInfoList().size() > 1) {

Review Comment:
   I agree with @jerqi, the current logic is too complicated.
   It is better to use an unified code path (by the way, one server is a special case of multiple servers) 
   
   I prefer to add a global data structure in composed handler, may be called "progress".
   It stores the information of consumed replicas and servers. 
   We could add the fallback in composed handler, and the each layer of handler can restart from the the position of last retry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [WIP][ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1031961832


##########
storage/src/main/java/org/apache/uniffle/storage/factory/ShuffleHandlerFactory.java:
##########
@@ -53,80 +54,84 @@ public static synchronized ShuffleHandlerFactory getInstance() {
     return INSTANCE;
   }
 
+
   public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request) {
+    if (CollectionUtils.isEmpty(request.getShuffleServerInfoList())) {
+      throw new RuntimeException("Shuffle servers should not be empty!");
+    }
+    if (request.getShuffleServerInfoList().size() > 1) {

Review Comment:
   > I prefer to add a global data structure in composed handler, may be called "progress".
   > It stores the information of consumed replicas and servers.
   > We could add the fallback in composed handler, and the each layer of handler can restart from the the last by reading the progress.
   
   This logic has the same problem as the previous version of this pr. If we read fail from the memory handler and read successful from the localfile handler. And then, the memory data flush to localfile and we read from memory again, some data maybe lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [WIP][ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1031962875


##########
common/src/main/java/org/apache/uniffle/common/ShuffleServerInfo.java:
##########
@@ -27,6 +27,13 @@ public class ShuffleServerInfo implements Serializable {
 
   private int port;
 
+  // Only for test
+  public ShuffleServerInfo(String host, int port) {
+    this.id = host + "-" + port;
+    this.host = host;

Review Comment:
   For unification and convenience. If we don't do this, we need modify many uts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291962984

   > So I feel that `maxRounds` isn't reasonable for this situation.
   
   I have another solution:
   1. Set maxFailTimes and failTime in every handler
   2. When `readShuffleData` fail, ++failTime 
   3. When `readShuffleData`  success, failTime = 0
   4. If failTime >= maxFailTimes , finished=true
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1292875866

   > > So I feel that `maxRounds` isn't reasonable for this situation.
   > 
   > I have another solution:
   > 
   > 1. Set maxFailTimes and failTime in every handler
   > 2. When `readShuffleData` fail, ++failTime
   > 3. When `readShuffleData`  success, failTime = 0
   > 4. If failTime >= maxFailTimes , finished=true
   >    What do you think?
   
   MaxFailureTime will tolerate the logic of replicas, this is my biggest concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012815252


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I think retry in the rpc client is better than retry in handler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1301570009

   Is there any design doc mentioned in #124?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] frankliee commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
frankliee commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1005152296


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsClientReadHandler.java:
##########
@@ -127,21 +127,32 @@ public ShuffleDataResult readShuffleData() {
       init(fullShufflePath);
     }
 
-    if (readHandlerIndex >= readHandlers.size()) {
+    if (finished()) {
       return new ShuffleDataResult();
     }
 
-    HdfsShuffleReadHandler hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-    ShuffleDataResult shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-
-    while (shuffleDataResult == null) {
-      ++readHandlerIndex;
+    HdfsShuffleReadHandler hdfsShuffleFileReader;

Review Comment:
   HdfsShuffleReadHandler hdfsShuffleFileReader = null; ?
    



##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsClientReadHandler.java:
##########
@@ -127,21 +127,32 @@ public ShuffleDataResult readShuffleData() {
       init(fullShufflePath);
     }
 
-    if (readHandlerIndex >= readHandlers.size()) {
+    if (finished()) {
       return new ShuffleDataResult();
     }
 
-    HdfsShuffleReadHandler hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-    ShuffleDataResult shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-
-    while (shuffleDataResult == null) {
-      ++readHandlerIndex;
+    HdfsShuffleReadHandler hdfsShuffleFileReader;
+    ShuffleDataResult shuffleDataResult = null;
+    do {
       if (readHandlerIndex >= readHandlers.size()) {
         return new ShuffleDataResult();
       }
       hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-      shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-    }
+      if (hdfsShuffleFileReader.finished()) {
+        ++readHandlerIndex;
+        continue;
+      }
+      try {
+        shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
+      } catch (Exception e) {
+        ++readHandlerIndex;
+        continue;
+      }
+
+      if (shuffleDataResult == null || shuffleDataResult.isEmpty()) {

Review Comment:
   It looks strange to use the same condition `(shuffleDataResult == null || shuffleDataResult.isEmpty())` twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291778154

   > One question: Should we use the concept of `FALLBACK`. In my opinion, we use fallback mechanism to process error or exception, but it's normal for us to read memory, localfile and hdfs at the same time.
   
   So, what is your opinion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1295675952

   > @jerqi What wrong with CI now?
   
   Maybe your test code have endless loop. The test code can't stop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1010110571


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MemoryQuorumClientReadHandler.java:
##########
@@ -33,6 +33,7 @@ public class MemoryQuorumClientReadHandler extends AbstractClientReadHandler {
   private static final Logger LOG = LoggerFactory.getLogger(MemoryQuorumClientReadHandler.class);
   private long lastBlockId = Constants.INVALID_BLOCK_ID;
   private List<MemoryClientReadHandler> handlers = Lists.newLinkedList();
+  private boolean readAfterFirstRound;

Review Comment:
   Why do we need this variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003954639


##########
client/src/main/java/org/apache/uniffle/client/impl/ShuffleReadClientImpl.java:
##########
@@ -60,6 +60,8 @@ public class ShuffleReadClientImpl implements ShuffleReadClient {
   private AtomicLong crcCheckTime = new AtomicLong(0);
   private ClientReadHandler clientReadHandler;
   private final IdHelper idHelper;
+  private int fallbackTimes;
+  private int maxFallbackTimes = 3;

Review Comment:
   We usually use constructor to pass the config option which we need for `ShuffleReadClieneImpl`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003954639


##########
client/src/main/java/org/apache/uniffle/client/impl/ShuffleReadClientImpl.java:
##########
@@ -60,6 +60,8 @@ public class ShuffleReadClientImpl implements ShuffleReadClient {
   private AtomicLong crcCheckTime = new AtomicLong(0);
   private ClientReadHandler clientReadHandler;
   private final IdHelper idHelper;
+  private int fallbackTimes;
+  private int maxFallbackTimes = 3;

Review Comment:
   We usually use constructor to pass the config option which we need.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003956021


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MemoryQuorumClientReadHandler.java:
##########
@@ -65,11 +72,26 @@ public ShuffleDataResult readShuffleData() {
       }
     }
 
-    if (!readSuccessful) {
+    if (!readSuccessful && !readAfterFallback) {

Review Comment:
   Why do we throw an exception here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291785949

   > > One question: Should we use the concept of `FALLBACK`. In my opinion, we use fallback mechanism to process error or exception, but it's normal for us to read memory, localfile and hdfs at the same time.
   > 
   > So, what is your opinion?
   
   1. Should we use method `next()` instead of `fallback()`?
   2. We shouldn't have the concept of `maxFallbackTimes`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020885432


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > But in this case the fail times of LocalFileQuorumClientReadHandler reach the maxHandlerFailTimes, LocalFileQuorumClientReadHandler have three LocalFileClientReadHandlers. Every LocalFileClientReadHandler don't reach maxHandlerFailTimes
   
   `LocalFileQuorumClientReadHandler`,  `MemoryQuorumClientReadHandler` and `HdfsClientReadHandler` have no `maxHandlerFailTimes`. If all LocalFileClientReadHandlers finished, it is finished.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi merged pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi merged PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1023841304


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > > Should we only modify the quorum handler's logic?
   > 
   > No, i think we need retry in ShuffleClient, we don't know whether we have read complete data until read from all quorum handlers in each round, and we need to check blocks consistent each round.
   
   If we pass the expectBlockId to the quorum handler, could we check blocks in the quorum handler?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020883455


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I see your case. Terrific! But in my understanding the fail times of LocalFileQuorumClientReadHandler reach the  maxHandlerFailTimes, LocalFileQuorumClientReadHandler have three LocalFileClientReadHandlers. Every  LocalFileClientReadHandler don't reach maxHandlerFailTimes. I mean that the fail times of LocalFileClientReadHandler  isn't meaningful. So could we change the code like my sayings? or the fail times of DataSkippableReadHandler is only used for HDFSFileClientReadHandler? 
   My concern is that the fail times of LocalFileClientReadHandler seems meaningless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020879699


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > Could you add a test case to explain when the fail time will reach the maxHandlerFailTimes?
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032470679


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MultiReplicaClientReadHandler.java:
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.storage.handler.impl;
+
+import java.util.List;
+
+import com.google.common.collect.Lists;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.common.BufferSegment;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.common.exception.RssException;
+import org.apache.uniffle.common.util.RssUtils;
+import org.apache.uniffle.storage.factory.ShuffleHandlerFactory;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.request.CreateShuffleReadHandlerRequest;
+
+public class MultiReplicaClientReadHandler extends AbstractClientReadHandler {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MultiReplicaClientReadHandler.class);
+
+  private List<ClientReadHandler> handlers = Lists.newLinkedList();
+
+  private long readBlockNum = 0L;
+  private long readLength = 0L;
+  private long readUncompressLength = 0L;
+  private Roaring64NavigableMap blockIdBitmap;
+  private Roaring64NavigableMap processedBlockIds;
+
+  private int readHandlerIndex;
+
+  public MultiReplicaClientReadHandler(

Review Comment:
   Do you mean use `handers` as a param of this constructor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020911727


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   This means a little complex. It's better that we can unify the behavior of handlers. We hope that we have a universe framework for handlers.Is it possible to accomplish this aim?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012615716


##########
storage/src/main/java/org/apache/uniffle/storage/request/CreateShuffleReadHandlerRequest.java:
##########
@@ -41,6 +41,7 @@ public class CreateShuffleReadHandlerRequest {
   private List<ShuffleServerInfo> shuffleServerInfoList;
   private Roaring64NavigableMap expectBlockIds;
   private Roaring64NavigableMap processBlockIds;
+  private int maxHanderFailTimes;

Review Comment:
   typo `maxHanderFailTimes` -> `maxHandlerFailTimes`, or rename it to maxHandlerRetries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012596819


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I mean according to your design the handle will be used again before reaching the `maxHanderFailTimes`, but the readHandlerIndex increased in the exception handling block, how can it be used again? Are there more rounds from the upper layer handler to traverse the handlers from index 0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012804034


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   If the first server stop, should we retry in the handler in the first round? In most cases, handler fail is for shuffle server is unavailable, we should try anther handler right now in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291909035

   > If you have three replicas, every replica have memory, disk and hdfs. Whether maxRounds 3 is enough to read all the data?
   
   No guarantee. For example, the blocks is incomplete after first round, and than can't read from any shuffle server which store the missing blocks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1294990410

   @jerqi What wrong with CI now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1301633604

   > > I have another solution:
   > > 
   > > 1. Set maxFailTimes and failTime in every handler
   > > 2. When `readShuffleData` fail, ++failTime
   > > 3. When `readShuffleData`  success, failTime = 0
   > > 4. If failTime >= maxFailTimes , finished=true
   > 
   > If you can't understand, i will try to write a design doc. @duanmeng
   
   No hurry, let's focus on this pr first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1010127612


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MemoryQuorumClientReadHandler.java:
##########
@@ -33,6 +33,7 @@ public class MemoryQuorumClientReadHandler extends AbstractClientReadHandler {
   private static final Logger LOG = LoggerFactory.getLogger(MemoryQuorumClientReadHandler.class);
   private long lastBlockId = Constants.INVALID_BLOCK_ID;
   private List<MemoryClientReadHandler> handlers = Lists.newLinkedList();
+  private boolean readAfterFirstRound;

Review Comment:
   The purpose is make sure at least one server can be read successfully in the first round. And i think we can remove it now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1011088882


##########
integration-test/common/src/test/java/org/apache/uniffle/test/ShuffleServerFaultToleranceTest.java:
##########
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.test;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.io.Files;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import org.apache.uniffle.client.TestUtils;
+import org.apache.uniffle.client.api.ShuffleServerClient;
+import org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient;
+import org.apache.uniffle.client.request.RssRegisterShuffleRequest;
+import org.apache.uniffle.client.request.RssSendShuffleDataRequest;
+import org.apache.uniffle.common.PartitionRange;
+import org.apache.uniffle.common.ShuffleBlockInfo;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.coordinator.CoordinatorConf;
+import org.apache.uniffle.coordinator.CoordinatorServer;
+import org.apache.uniffle.server.MockedShuffleServer;
+import org.apache.uniffle.server.ShuffleServer;
+import org.apache.uniffle.server.ShuffleServerConf;
+import org.apache.uniffle.server.buffer.ShuffleBuffer;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.ComposedClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.MemoryQuorumClientReadHandler;
+import org.apache.uniffle.storage.util.StorageType;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.fail;
+
+public class ShuffleServerFaultToleranceTest extends ShuffleReadWriteBase {
+
+  private List<ShuffleServerClient> shuffleServerClients;
+
+  @BeforeAll
+  public static void setupServers() throws Exception {
+    CoordinatorConf coordinatorConf = getCoordinatorConf();
+    createCoordinatorServer(coordinatorConf);
+    shuffleServers.add(createServer(0));
+    shuffleServers.add(createServer(1));
+    shuffleServers.add(createServer(2));
+    startServers();
+  }
+
+  @BeforeEach
+  public void createClient() {
+    shuffleServerClients = new ArrayList<>();
+    for (ShuffleServer shuffleServer : shuffleServers) {
+      shuffleServerClients.add(new ShuffleServerGrpcClient(shuffleServer.getIp(), shuffleServer.getPort()));
+    }
+  }
+
+  @AfterEach
+  public void cleanEnv() throws Exception {
+    shuffleServerClients.forEach((client) -> {
+      client.close();
+    });
+    cleanCluster();
+    setupServers();
+  }
+
+  @Test
+  public void testReadFaultTolerance() throws Exception {
+    String testAppId = "ShuffleServerFaultToleranceTest.testReadFaultTolerance";

Review Comment:
   Blocks read inconsistent, as we discuss in #124. Because if it read empty result or incomplete blocks from the first server, it will not try other server.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012472184


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {

Review Comment:
   Could you explain why try multiple times here? 



##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   This exception would be handled by`HdfsClientReadHandler`, which increases the `handlerIndex`  and moves to the next handler, so what is the point of the `finished` flag here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012792038


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   >Yes, but what's the problem and what's the impact?I don't understand.
   
   Not a problem, we do not need to record the retries in the low-level handlers
   
   >Upper layer means ShuffleClient? I think we must retry in ShuffleClient
   
   Yes, agree. Check consistency and call the reshuffleData in `ShuffleClient`, and do the actual handlers retries (low-level handler loops) in the `clientReadHandler`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012804034


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   If the first server stop, should we retry in the handler in the first round? In most cases, handler fail is for shuffle server is unavailable, we should try another handler right now in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020858782


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > > we won't retry this handler because we throw the exception here
   > 
   > Why not? It just will not retry in current round, but it will retry in next round.
   
   If upper layer is LocalFileQuorumClientReadHanlder, we will retry another handler when we throw an exception.
   https://github.com/apache/incubator-uniffle/blob/4a3d2be36795df2eaf2edbe75ecc5816bf7eb87a/storage/src/main/java/org/apache/uniffle/storage/handler/impl/LocalFileQuorumClientReadHandler.java#L105
   
   It seems that fail handler times won't reach the maxHandlerFailTimes  in any time.   Is  it necessary that we increase the failTimes? Could we change the code?
   ```
           if (++failTimes > maxHandlerFailTimes) {
             isFinished = true;
           }
           throw e
   ```
   to 
   ```
   isFinished = true;
   throw e
   ```
   Could you add a test case to explain when the fail time will reach the maxHandlerFailTimes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1023954845


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/ComposedClientReadHandler.java:
##########
@@ -228,6 +228,40 @@ public void logConsumedBlockInfo() {
     LOG.info(getReadUncompressLengthInfo());
   }
 
+  @Override

Review Comment:
   Add some comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1023776201


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   The problem which we want to solve is that we want to read correct data when one replica don't have complete data. Should we only modify the quorum handler's logic? We don't expose the logic to upper layer. We process the fault tolerance in the quorum handler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1327970749

   LGTM except for minor issues , cc @Gustfh Do you have another suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032491683


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MultiReplicaClientReadHandler.java:
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.storage.handler.impl;
+
+import java.util.List;
+
+import com.google.common.collect.Lists;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.common.BufferSegment;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.common.exception.RssException;
+import org.apache.uniffle.common.util.RssUtils;
+import org.apache.uniffle.storage.factory.ShuffleHandlerFactory;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.request.CreateShuffleReadHandlerRequest;
+
+public class MultiReplicaClientReadHandler extends AbstractClientReadHandler {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MultiReplicaClientReadHandler.class);
+
+  private List<ClientReadHandler> handlers = Lists.newLinkedList();
+
+  private long readBlockNum = 0L;
+  private long readLength = 0L;
+  private long readUncompressLength = 0L;
+  private Roaring64NavigableMap blockIdBitmap;
+  private Roaring64NavigableMap processedBlockIds;
+
+  private int readHandlerIndex;
+
+  public MultiReplicaClientReadHandler(

Review Comment:
   Yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1324917478

   The basic logic is ok. I think you can go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] frankliee commented on a diff in pull request #276: [WIP][ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
frankliee commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1031644602


##########
storage/src/main/java/org/apache/uniffle/storage/factory/ShuffleHandlerFactory.java:
##########
@@ -53,80 +54,84 @@ public static synchronized ShuffleHandlerFactory getInstance() {
     return INSTANCE;
   }
 
+
   public ClientReadHandler createShuffleReadHandler(CreateShuffleReadHandlerRequest request) {
+    if (CollectionUtils.isEmpty(request.getShuffleServerInfoList())) {
+      throw new RuntimeException("Shuffle servers should not be empty!");
+    }
+    if (request.getShuffleServerInfoList().size() > 1) {

Review Comment:
   I agree with @jerqi, the current logic is too complicated.
   It is better to use an unified code path (by the way, one server is a special case of multiple servers) 
   
   I prefer to add a global data structure in composed handler, may be called "progress".
   It stores the information of consumed replicas and servers. 
   We could add the fallback in composed handler, and the each layer of handler can restart from the the last by reading the progress.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032499473


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MultiReplicaClientReadHandler.java:
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.storage.handler.impl;
+
+import java.util.List;
+
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.common.BufferSegment;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.common.exception.RssException;
+import org.apache.uniffle.common.util.RssUtils;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+
+public class MultiReplicaClientReadHandler extends AbstractClientReadHandler {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MultiReplicaClientReadHandler.class);
+
+  private final List<ClientReadHandler> handlers;
+
+  private long readBlockNum = 0L;
+  private long readLength = 0L;
+  private long readUncompressLength = 0L;
+  private final Roaring64NavigableMap blockIdBitmap;
+  private final Roaring64NavigableMap processedBlockIds;
+
+  private int readHandlerIndex;
+
+  public MultiReplicaClientReadHandler(
+      List<ClientReadHandler> handlers,
+      Roaring64NavigableMap blockIdBitmap,
+      Roaring64NavigableMap processedBlockIds) {
+    this.handlers = handlers;
+    this.blockIdBitmap = blockIdBitmap;
+    this.processedBlockIds = processedBlockIds;
+  }
+
+  @Override
+  public ShuffleDataResult readShuffleData() {
+    ClientReadHandler handler;
+    ShuffleDataResult result = null;
+    do {
+      if (readHandlerIndex >= handlers.size()) {
+        return result;
+      }
+      handler = handlers.get(readHandlerIndex);
+      try {
+        result = handler.readShuffleData();
+      } catch (Exception e) {
+        LOG.warn("Failed to read a replica due to ", e);
+      }
+      if (result != null && !result.isEmpty()) {
+        return result;
+      } else {
+        readHandlerIndex++;
+        try {
+          RssUtils.checkProcessedBlockIds(blockIdBitmap, processedBlockIds);
+          return result;
+        } catch (RssException e) {
+          LOG.warn(e.getMessage());

Review Comment:
   Could we log more information to help us to debug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1327263345

   PTAL @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1324900411

   > Could we use the order
   > server A memory -> server A LOCALFILE -> server A HDFS
   > server B memory -> server B LOCALFILE -> server B HDFS
   > server C memory -> server C LOCALFILE -> server C HDFS
   
   I have already followed this plan. If the basic logic is OK, i will remove some deprecated class and add more uts. @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012591367


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {

Review Comment:
   We are not sure how many cases will cause exception. I think most of the reasons are network problems or the server load is high. And i think we only need to consider most cases here, because stability is more important when there is not much difference in performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1010398531


##########
integration-test/common/src/test/java/org/apache/uniffle/test/ShuffleServerFaultToleranceTest.java:
##########
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.test;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.io.Files;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import org.apache.uniffle.client.TestUtils;
+import org.apache.uniffle.client.api.ShuffleServerClient;
+import org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient;
+import org.apache.uniffle.client.request.RssRegisterShuffleRequest;
+import org.apache.uniffle.client.request.RssSendShuffleDataRequest;
+import org.apache.uniffle.common.PartitionRange;
+import org.apache.uniffle.common.ShuffleBlockInfo;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.coordinator.CoordinatorConf;
+import org.apache.uniffle.coordinator.CoordinatorServer;
+import org.apache.uniffle.server.MockedShuffleServer;
+import org.apache.uniffle.server.ShuffleServer;
+import org.apache.uniffle.server.ShuffleServerConf;
+import org.apache.uniffle.server.buffer.ShuffleBuffer;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.ComposedClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.MemoryQuorumClientReadHandler;
+import org.apache.uniffle.storage.util.StorageType;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.fail;
+
+public class ShuffleServerFaultToleranceTest extends ShuffleReadWriteBase {
+
+  private List<ShuffleServerClient> shuffleServerClients;
+
+  @BeforeAll
+  public static void setupServers() throws Exception {
+    CoordinatorConf coordinatorConf = getCoordinatorConf();
+    createCoordinatorServer(coordinatorConf);
+    shuffleServers.add(createServer(0));
+    shuffleServers.add(createServer(1));
+    shuffleServers.add(createServer(2));
+    startServers();
+  }
+
+  @BeforeEach
+  public void createClient() {
+    shuffleServerClients = new ArrayList<>();
+    for (ShuffleServer shuffleServer : shuffleServers) {
+      shuffleServerClients.add(new ShuffleServerGrpcClient(shuffleServer.getIp(), shuffleServer.getPort()));
+    }
+  }
+
+  @AfterEach
+  public void cleanEnv() throws Exception {
+    shuffleServerClients.forEach((client) -> {
+      client.close();
+    });
+    cleanCluster();
+    setupServers();
+  }
+
+  @Test
+  public void testReadFaultTolerance() throws Exception {
+    String testAppId = "ShuffleServerFaultToleranceTest.testReadFaultTolerance";

Review Comment:
   1. This case mainly tests whether we can read the complete data from `MemoryQuorumClientReadHandler` and `LocalFileQuorumClientReadHandler`  when one shuffle server stop
   2. Three shuffle servers will be started here and blocks will only be send the second server and third server every times
   3. Make sure read data from memory in the first time, and read data from localfile in the second time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1010398531


##########
integration-test/common/src/test/java/org/apache/uniffle/test/ShuffleServerFaultToleranceTest.java:
##########
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.test;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.io.Files;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import org.apache.uniffle.client.TestUtils;
+import org.apache.uniffle.client.api.ShuffleServerClient;
+import org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient;
+import org.apache.uniffle.client.request.RssRegisterShuffleRequest;
+import org.apache.uniffle.client.request.RssSendShuffleDataRequest;
+import org.apache.uniffle.common.PartitionRange;
+import org.apache.uniffle.common.ShuffleBlockInfo;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.coordinator.CoordinatorConf;
+import org.apache.uniffle.coordinator.CoordinatorServer;
+import org.apache.uniffle.server.MockedShuffleServer;
+import org.apache.uniffle.server.ShuffleServer;
+import org.apache.uniffle.server.ShuffleServerConf;
+import org.apache.uniffle.server.buffer.ShuffleBuffer;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.ComposedClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.MemoryQuorumClientReadHandler;
+import org.apache.uniffle.storage.util.StorageType;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.fail;
+
+public class ShuffleServerFaultToleranceTest extends ShuffleReadWriteBase {
+
+  private List<ShuffleServerClient> shuffleServerClients;
+
+  @BeforeAll
+  public static void setupServers() throws Exception {
+    CoordinatorConf coordinatorConf = getCoordinatorConf();
+    createCoordinatorServer(coordinatorConf);
+    shuffleServers.add(createServer(0));
+    shuffleServers.add(createServer(1));
+    shuffleServers.add(createServer(2));
+    startServers();
+  }
+
+  @BeforeEach
+  public void createClient() {
+    shuffleServerClients = new ArrayList<>();
+    for (ShuffleServer shuffleServer : shuffleServers) {
+      shuffleServerClients.add(new ShuffleServerGrpcClient(shuffleServer.getIp(), shuffleServer.getPort()));
+    }
+  }
+
+  @AfterEach
+  public void cleanEnv() throws Exception {
+    shuffleServerClients.forEach((client) -> {
+      client.close();
+    });
+    cleanCluster();
+    setupServers();
+  }
+
+  @Test
+  public void testReadFaultTolerance() throws Exception {
+    String testAppId = "ShuffleServerFaultToleranceTest.testReadFaultTolerance";

Review Comment:
   1. This case mainly tests whether we can read the complete data from `MemoryQuorumClientReadHandler` and `LocalFileQuorumClientReadHandler`  when the data of the first server is incomplete
   2. Three shuffle servers will be started here and blocks will only be send to the second server and third server every times
   3. Make sure read data from memory in the first time, and read data from localfile in the second time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012776363


##########
storage/src/main/java/org/apache/uniffle/storage/request/CreateShuffleReadHandlerRequest.java:
##########
@@ -41,6 +41,7 @@ public class CreateShuffleReadHandlerRequest {
   private List<ShuffleServerInfo> shuffleServerInfoList;
   private Roaring64NavigableMap expectBlockIds;
   private Roaring64NavigableMap processBlockIds;
+  private int maxHanderFailTimes;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291715947

   One question: Should we use the concept of `FALLBACK`. In my opinion, we use fallback mechanism to process error or exception, but it's normal for us to read memory, localfile and hdfs at the same time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1293050481

   > For replica logic, if we use 7 replicas, we should read 4 replica successfully, but if maxFailure is 3 . Although we have 4 correct replicas, the application will fail because the app may read 3 wrong replicas.
   
   I mean `maxFailTimes` is for each replica.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1023747411


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I don't have a good idea. How about add an abstract class  for each of  quorum handler and single handler? Single handler will retry until reach maxHandlerFailTimes and quorum handler will retry until all child handlers finished. We can also put other common logic into it. @jerqi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020845927


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   1.If this exception thrown here, the upper handler will catch it and move to next lower handler.
   2.if the network temporarily unavailable, read shuffle index may fail here,so i think we should retry in next round, until reach `maxHandlerFailTimes`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1020839148


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   I don't understand here
   ```
           if (++failTimes > maxHandlerFailTimes) {
             isFinished = true;
           }
           throw e;
   ```
   If we throw a exception to upper layer,  I feel that it means that the handler is finished. Why do we need to reach the max failure time and then set the `isFinished` value to the `true`?
   Could we add a test case for this situation to help us understand?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1023806989


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -67,8 +68,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHandlerFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > Should we only modify the quorum handler's logic?
   
   No, i think we need retry in ShuffleClient, we don't know whether we have read complete data until read from all quorum handlers in each round, and we need to check blocks consistent each round.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1293042041

   > MaxFailureTime will tolerate the logic of replicas, this is my biggest concern.
   
   I don't understand


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291866763

   > > * Should we use method `next()` instead of `fallback()`?
   > > * We shouldn't have the concept of `maxFallbackTimes`.
   > 
   > How about fallback() -> nextRound() and maxFallbackTimes -> maxRounds?
   If you have three replicas, every replica have memory, disk and hdfs. Whether maxRounds 3 is enough to read all the data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1291910507

   > > If you have three replicas, every replica have memory, disk and hdfs. Whether maxRounds 3 is enough to read all the data?
   > 
   > No guarantee. For example, the blocks is incomplete after first round, and than can't read from any shuffle server which store the missing blocks
   
   So I feel that `maxRounds` isn't reasonable for this situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012547549


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {

Review Comment:
   I can't get your point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012593385


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {

Review Comment:
   Or we can optimize it in another pr. Because this change is also complex.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012550589


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   `isFinished` is for mark don't try read from this handler.  In original logic, if `HdfsClientReadHandler` catch this exception, it will not retry read from this handler, but it will retry in this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012747903


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {
+          isFinished = true;
+        }
+        throw e;

Review Comment:
   > So it may loop the handlers a second time, skip the finished ones and some handlers may be still unfinished for the maxHandlerFailTimes might be greater than 2?
   
   Yes, but what's the problem and what's the impact?I don't understand.
   
   > * Do the retry in the upper layer
   
   Upper layer means `ShuffleClient`? I think we must retry in `ShuffleClient`, because we don't need read from all handler in every round, and we need to check blocks consistent every round



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012748453


##########
storage/src/main/java/org/apache/uniffle/storage/request/CreateShuffleReadHandlerRequest.java:
##########
@@ -41,6 +41,7 @@ public class CreateShuffleReadHandlerRequest {
   private List<ShuffleServerInfo> shuffleServerInfoList;
   private Roaring64NavigableMap expectBlockIds;
   private Roaring64NavigableMap processBlockIds;
+  private int maxHanderFailTimes;

Review Comment:
   I think `maxHandlerFailTimes` is better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] duanmeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
duanmeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1012580446


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java:
##########
@@ -59,8 +60,17 @@ public DataSkippableReadHandler(
 
   public ShuffleDataResult readShuffleData() {
     if (shuffleDataSegments.isEmpty()) {
-      ShuffleIndexResult shuffleIndexResult = readShuffleIndex();
+      ShuffleIndexResult shuffleIndexResult;
+      try {
+        shuffleIndexResult = readShuffleIndex();
+      } catch (Exception e) {
+        if (++failTimes > maxHanderFailTimes) {

Review Comment:
   The file (index, data) associated with the handle should be corrupted somewhere, or not even exist if the handle fails to read. Could you help to explain the benefits to try multiple times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1011082247


##########
integration-test/common/src/test/java/org/apache/uniffle/test/ShuffleServerFaultToleranceTest.java:
##########
@@ -0,0 +1,233 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.test;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.io.Files;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.roaringbitmap.longlong.Roaring64NavigableMap;
+
+import org.apache.uniffle.client.TestUtils;
+import org.apache.uniffle.client.api.ShuffleServerClient;
+import org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient;
+import org.apache.uniffle.client.request.RssRegisterShuffleRequest;
+import org.apache.uniffle.client.request.RssSendShuffleDataRequest;
+import org.apache.uniffle.common.PartitionRange;
+import org.apache.uniffle.common.ShuffleBlockInfo;
+import org.apache.uniffle.common.ShuffleDataResult;
+import org.apache.uniffle.coordinator.CoordinatorConf;
+import org.apache.uniffle.coordinator.CoordinatorServer;
+import org.apache.uniffle.server.MockedShuffleServer;
+import org.apache.uniffle.server.ShuffleServer;
+import org.apache.uniffle.server.ShuffleServerConf;
+import org.apache.uniffle.server.buffer.ShuffleBuffer;
+import org.apache.uniffle.storage.handler.api.ClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.ComposedClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.LocalFileQuorumClientReadHandler;
+import org.apache.uniffle.storage.handler.impl.MemoryQuorumClientReadHandler;
+import org.apache.uniffle.storage.util.StorageType;
+
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.fail;
+
+public class ShuffleServerFaultToleranceTest extends ShuffleReadWriteBase {
+
+  private List<ShuffleServerClient> shuffleServerClients;
+
+  @BeforeAll
+  public static void setupServers() throws Exception {
+    CoordinatorConf coordinatorConf = getCoordinatorConf();
+    createCoordinatorServer(coordinatorConf);
+    shuffleServers.add(createServer(0));
+    shuffleServers.add(createServer(1));
+    shuffleServers.add(createServer(2));
+    startServers();
+  }
+
+  @BeforeEach
+  public void createClient() {
+    shuffleServerClients = new ArrayList<>();
+    for (ShuffleServer shuffleServer : shuffleServers) {
+      shuffleServerClients.add(new ShuffleServerGrpcClient(shuffleServer.getIp(), shuffleServer.getPort()));
+    }
+  }
+
+  @AfterEach
+  public void cleanEnv() throws Exception {
+    shuffleServerClients.forEach((client) -> {
+      client.close();
+    });
+    cleanCluster();
+    setupServers();
+  }
+
+  @Test
+  public void testReadFaultTolerance() throws Exception {
+    String testAppId = "ShuffleServerFaultToleranceTest.testReadFaultTolerance";

Review Comment:
   If we don't have this patch, what will it happen in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1301617358

   > I have another solution:
   > 
   > 1. Set maxFailTimes and failTime in every handler
   > 2. When `readShuffleData` fail, ++failTime
   > 3. When `readShuffleData`  success, failTime = 0
   > 4. If failTime >= maxFailTimes , finished=true
   
   If you can't understand, i will try to write a design doc.  @duanmeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032498162


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsClientReadHandler.java:
##########
@@ -154,23 +158,26 @@ public ShuffleDataResult readShuffleData() {
       init(fullShufflePath);
     }
 
-    if (readHandlerIndex >= readHandlers.size()) {
-      return new ShuffleDataResult();
-    }
-
-    HdfsShuffleReadHandler hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-    ShuffleDataResult shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-
-    while (shuffleDataResult == null) {
-      ++readHandlerIndex;
+    HdfsShuffleReadHandler hdfsShuffleFileReader;
+    ShuffleDataResult shuffleDataResult;
+    do {
       if (readHandlerIndex >= readHandlers.size()) {
         return new ShuffleDataResult();
       }
       hdfsShuffleFileReader = readHandlers.get(readHandlerIndex);
-      shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
-    }
+      try {
+        shuffleDataResult = hdfsShuffleFileReader.readShuffleData();
+      } catch (Exception e) {
+        ++readHandlerIndex;
+        continue;
+      }
 
-    return shuffleDataResult;
+      if (shuffleDataResult == null || shuffleDataResult.isEmpty()) {
+        ++readHandlerIndex;
+      } else {
+        return shuffleDataResult;
+      }
+    } while (true);

Review Comment:
   `While(true)` is not a good code style.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1032729966


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsShuffleReadHandler.java:
##########
@@ -94,8 +94,8 @@ protected ShuffleIndexResult readShuffleIndex() {
       return new ShuffleIndexResult(indexData, dateFileLen);
     } catch (Exception e) {
       LOG.info("Fail to read index files {}.index", filePrefix, e);
+      throw e;

Review Comment:
   Why do we need this? If this is not necessary for this pr, I prefer less change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #276: Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#issuecomment-1288871118

   What's the relation between this pr and #129 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent
URL: https://github.com/apache/incubator-uniffle/pull/276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #276: Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003939505


##########
client/src/main/java/org/apache/uniffle/client/impl/ShuffleReadClientImpl.java:
##########
@@ -60,6 +60,8 @@ public class ShuffleReadClientImpl implements ShuffleReadClient {
   private AtomicLong crcCheckTime = new AtomicLong(0);
   private ClientReadHandler clientReadHandler;
   private final IdHelper idHelper;
+  private int fallbackTimes;
+  private int maxFallbackTimes = 3;

Review Comment:
   Why should `maxFallbackTimes` be 3?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #276: [ISSUE-124] Add fallback mechanism for blocks read inconsistent

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #276:
URL: https://github.com/apache/incubator-uniffle/pull/276#discussion_r1003959454


##########
storage/src/main/java/org/apache/uniffle/storage/handler/impl/MemoryQuorumClientReadHandler.java:
##########
@@ -65,11 +72,26 @@ public ShuffleDataResult readShuffleData() {
       }
     }
 
-    if (!readSuccessful) {
+    if (!readSuccessful && !readAfterFallback) {

Review Comment:
   Because we don't know where the missing blocks be, even if read fail from memory after fallback, these blocks may also be complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org