You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/10/23 06:19:44 UTC

[GitHub] [incubator-uniffle] xianjingfeng opened a new pull request, #235: Write to hdfs when local disk can't be write

xianjingfeng opened a new pull request, #235:
URL: https://github.com/apache/incubator-uniffle/pull/235

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://github.com/Tencent/Firestorm/blob/master/CONTRIBUTING.md
     2. Ensure you have added or run the appropriate tests for your PR
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]XXXX Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Write to hdfs when local disk can't be write
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   There should be a fallback mechanism when disk can't be write. #163 
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Already added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r979508321


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -80,24 +80,43 @@ public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFl
         if (request == null) {
           return false;
         }
-        storage = warmStorageManager.selectStorage(event);
+        storageManager = storageManager == warmStorageManager ? coldStorageManager : warmStorageManager;
+        event.setStorageManager(storageManager);
+        storage = storageManager.selectStorage(event);
         handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
         return false;
       }
-      return warmStorageManager.write(storage, handler, event);
+      return storageManager.write(storage, handler, event);
     } else {
       return storageManager.write(storage, handler, event);
     }
   }
 
   private StorageManager selectStorageManager(ShuffleDataFlushEvent event) {
+    if (event.getStorageManager() != null) {

Review Comment:
   Could we give a general abstraction for fallback mechanisms. I mean that we can have an fallback interface. And implement two strategies. One is that HDFS can fallback to LOCALFILE, the other is that LOCALFILE can fallback to HDFS. We can choose one with the configuration. And we can expand other strategies if we need.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   It's a little weird that ShuffleDataFlushEvent have a member field `StorageManager`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1003138279


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   Is there a possibility that the disk is temporarily full and the network is temporarily unavailable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005392878


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   Why do we add the parameter `request`? Why don't use the origin way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005439635


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   In origin way, we can get the request from the storage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston merged pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston merged PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005305601


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   I think the original mechanism is for when hdfs can't be writed. And i think it is involved in this strategy. In this pr, if you want to write to hdfs as much as possible when hdfs temporarily can't be write, you can set `retryTimes` to a larger value, and it will be same as original mechanism when `retryTimes` is equal to `retryMax`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] LuciferYang commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1037044436


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   @jerqi Busy at the end of the year. I'll think this on the weekend
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1037169331


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   > I have two directions to improve this part of MultiStorageManager
   > 
   > 1. Remove the `storageManagerCache` , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.
   > 2. Avoid changing the external reference object in `write` method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write)  if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from `write` method to `selectStorage` method.
   
   Agree



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1006474951


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   If shuffle server have many disk errors, they will fallback many small io write request to HDFS, HDFS may be influenced. So not all people want to fallback from local file to HDFS.  Maybe we need an configuration option to make users to select his best strategy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1294599758

   Merged. @xianjingfeng Thanks for your contribution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005436290


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {

Review Comment:
   `event.getRetryTimes() % fallBackTimes` 
   Why do we need this condition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1257384669

   I think this PR is a good improvement! We also need this PR to avoid the problem of full local disk, although we dont hope to enable the big block directly written to HDFS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002981557


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   It's strange that the event contains storage manager for me.  Could we record the mapping information in the storage manager?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] codecov-commenter commented on pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1253530981

   # [Codecov](https://codecov.io/gh/apache/incubator-uniffle/pull/235?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#235](https://codecov.io/gh/apache/incubator-uniffle/pull/235?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4dccc6a) into [master](https://codecov.io/gh/apache/incubator-uniffle/commit/e6b4260aa54a9ffda30ec02fa050be1c108e8a9f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e6b4260) will **increase** coverage by `0.13%`.
   > The diff coverage is `57.14%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master     #235      +/-   ##
   ============================================
   + Coverage     59.17%   59.30%   +0.13%     
   - Complexity     1332     1337       +5     
   ============================================
     Files           160      160              
     Lines          8732     8744      +12     
     Branches        819      821       +2     
   ============================================
   + Hits           5167     5186      +19     
   + Misses         3300     3289      -11     
   - Partials        265      269       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-uniffle/pull/235?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...he/uniffle/server/storage/MultiStorageManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/235/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9zdG9yYWdlL011bHRpU3RvcmFnZU1hbmFnZXIuamF2YQ==) | `39.13% <57.14%> (+1.63%)` | :arrow_up: |
   | [...he/uniffle/server/buffer/ShuffleBufferManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/235/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9idWZmZXIvU2h1ZmZsZUJ1ZmZlck1hbmFnZXIuamF2YQ==) | `82.15% <0.00%> (-1.04%)` | :arrow_down: |
   | [...rg/apache/uniffle/server/ShuffleServerMetrics.java](https://codecov.io/gh/apache/incubator-uniffle/pull/235/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9TaHVmZmxlU2VydmVyTWV0cmljcy5qYXZh) | `96.22% <0.00%> (+0.10%)` | :arrow_up: |
   | [...he/uniffle/server/storage/LocalStorageManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/235/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9zdG9yYWdlL0xvY2FsU3RvcmFnZU1hbmFnZXIuamF2YQ==) | `74.03% <0.00%> (+12.50%)` | :arrow_up: |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002838002


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   Because `selectStorageManager` will be called again In `org.apache.uniffle.server.storage.MultiStorageManager#updateWriteMetrics`, The metrics will be incorrent if we don't this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002838002


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   Because `selectStorageManager` will be called again In `org.apache.uniffle.server.storage.MultiStorageManager#updateWriteMetrics`, The metrics will be incorrent if we don't do this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1006744022


##########
server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java:
##########
@@ -263,6 +263,12 @@ public class ShuffleServerConf extends RssBaseConf {
       .defaultValue(64L * 1024L * 1024L)
       .withDescription("For multistorage, the event size exceed this value, flush data  to cold storage");
 
+  public static final ConfigOption<String> MULTISTORAGE_FALLBACK_STRATEGY_CLASS = ConfigOptions
+      .key("rss.server.multistorage.fallback.strategy.class")
+      .stringType()
+      .noDefaultValue()

Review Comment:
   We should choose origin behavior as default value. 
   Could we add some docs for this config option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005454026


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {

Review Comment:
   Until reach retryMax



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005581158


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {

Review Comment:
   I got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1003085785


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   In the original fallback strategy, blocks will not be written to `coldStorageManager` after `retryTimes > fallBackTimes` again. How about If `warmStorageManager` can't be written too. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005446879


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   > Do you mean we can get the request from the old storage when create a new handler?
   
   Yes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1037029792


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   > > Currently I have no ideas. I just want to use this fallback strategy and review this part.
   > 
   > Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.
   
   @jerqi I'm just browsing the code. Let's discuss details later when I get more context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005276140


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();

Review Comment:
   Ok, I got it.



##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);
         }
-        storage = warmStorageManager.selectStorage(event);
-        handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
-        return false;
       }
-      return warmStorageManager.write(storage, handler, event);
-    } else {
-      return storageManager.write(storage, handler, event);
     }
+    boolean success = storageManager.write(storage, handler, event, request);

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005285301


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   I think the original retryTimes fallback is for the big data block directly flushed to HDFS. If we preserve this mechanism, we also should preserve this fallback strategy. Is it involved in this strategy?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005305601


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   I think the original mechanism is for when hdfs can't be writed. And i think it is involved in this strategy. In this pr, if you want to write to hdfs as much as possible when hdfs temporarily can't be write, you can set `fallBackTimes` to a larger value, and it will be same as original mechanism when `fallBackTimes` is equal to `retryMax`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1003079231


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   Agree. and i just found out `org.apache.uniffle.server.storage.MultiStorageManager#updateWriteMetrics` is not referenced, maybe we don't need do anything in the method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005625411


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {
+      return current;
+    }
+    int nextIdx = -1;
+    for (int i = 0; i < candidates.length; i++) {
+      if (current == candidates[i]) {
+        nextIdx = (i + 1) % candidates.length;

Review Comment:
   I think it's easy logic. One loop is enough. If you insist on it, I'm also ok for two loops.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005166663


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();

Review Comment:
   Why do we throw this exception? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005445621


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   Do you mean we can get the request from the old storage when create a new handler?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005305601


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   I think the original mechanism is for when hdfs can't be writed. And i think is it involved in this strategy. In this pr, if you want to write to hdfs as much as possible when hdfs temporarily can't be write, you can set `retryTimes` to a larger value, and it will be same as original mechanism when `retryTimes` is equal to `retryMax`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1273068493

   Do you have time to invest this PR, I hope this can be introduced in our company internal version, looking forward to be merged assp. @xianjingfeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035595678


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   Currently I have no ideas. I just want to use this fallback strategy and review this part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035598713


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   > Currently I have no ideas. I just want to use this fallback strategy and review this part.
   
   Flush process worry me a lot. I'm not satisfied about the code quality. But I don't have idea how to improve them. Maybe @LuciferYang @advancedxy can help us.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r979509980


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -80,24 +80,43 @@ public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFl
         if (request == null) {
           return false;
         }
-        storage = warmStorageManager.selectStorage(event);
+        storageManager = storageManager == warmStorageManager ? coldStorageManager : warmStorageManager;
+        event.setStorageManager(storageManager);
+        storage = storageManager.selectStorage(event);
         handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
         return false;
       }
-      return warmStorageManager.write(storage, handler, event);
+      return storageManager.write(storage, handler, event);
     } else {
       return storageManager.write(storage, handler, event);
     }
   }
 
   private StorageManager selectStorageManager(ShuffleDataFlushEvent event) {
+    if (event.getStorageManager() != null) {

Review Comment:
   Good idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002723836


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   Why do we use storageManager as field?



##########
server/src/main/java/org/apache/uniffle/server/storage/SingleStorageManager.java:
##########
@@ -113,6 +115,22 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
     }
   }
 
+
+  @Override
+  public boolean canWrite(ShuffleDataFlushEvent event) {
+    try {
+      Storage storage = selectStorage(event);
+      // if storage is null, appId may not be registered
+      if (storage == null || !storage.canWrite()) {
+        return false;
+      }
+      return true;
+    } catch (Exception e) {
+      LOG.warn("", e);

Review Comment:
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005247150


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);
         }
-        storage = warmStorageManager.selectStorage(event);
-        handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
-        return false;
       }
-      return warmStorageManager.write(storage, handler, event);
-    } else {
-      return storageManager.write(storage, handler, event);
     }
+    boolean success = storageManager.write(storage, handler, event, request);

Review Comment:
   No, because` selectStorageManager` will be called multiple times



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005244279


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();

Review Comment:
   1.`org.apache.uniffle.server.storage.MultiStorageManager#updateWriteMetrics` is not referenced, so it is useless.
   2.Cache entry for this event will be invalidated after write success



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005452628


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {

Review Comment:
   If one storage can't not write for `fallBackTimes` times, it can fallback to another



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1254716130

   There are some flaky ut
   
   `
   java.lang.ClassCastException: org.apache.spark.shuffle.RssShuffleManager cannot be cast to org.apache.uniffle.test.GetShuffleReportForMultiPartTest$RssShuffleManagerWrapper
   	at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.runTest(GetShuffleReportForMultiPartTest.java:180)
   	at org.apache.uniffle.test.SparkIntegrationTestBase.runSparkApp(SparkIntegrationTestBase.java:74)
   	at org.apache.uniffle.test.SparkIntegrationTestBase.run(SparkIntegrationTestBase.java:52)
   	at org.apache.uniffle.test.GetShuffleReportForMultiPartTest.resultCompareTest(GetShuffleReportForMultiPartTest.java:141)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   `
   
   
   
   `
   Error:  Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 12.256 s <<< FAILURE! - in org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest
   Error:  selectStorageTest  Time elapsed: 6.083 s  <<< FAILURE!
   org.opentest4j.AssertionFailedError: expected: <hdfs://p2> but was: <hdfs://p1>
   	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
   	at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
   	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
   	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
   	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)
   	at org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategyTest.selectStorageTest(LowestIOSampleCostSelectStorageStrategyTest.java:133)
   `
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r979514633


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -80,24 +80,43 @@ public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFl
         if (request == null) {
           return false;
         }
-        storage = warmStorageManager.selectStorage(event);
+        storageManager = storageManager == warmStorageManager ? coldStorageManager : warmStorageManager;
+        event.setStorageManager(storageManager);
+        storage = storageManager.selectStorage(event);
         handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
         return false;
       }
-      return warmStorageManager.write(storage, handler, event);
+      return storageManager.write(storage, handler, event);
     } else {
       return storageManager.write(storage, handler, event);
     }
   }
 
   private StorageManager selectStorageManager(ShuffleDataFlushEvent event) {
+    if (event.getStorageManager() != null) {

Review Comment:
   good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r979509794


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -71,7 +71,7 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > fallBackTimes) {

Review Comment:
   Why not put these select condition into one abstract method? such as being involved in `selectStorageManager`



##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -80,24 +80,43 @@ public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFl
         if (request == null) {
           return false;
         }
-        storage = warmStorageManager.selectStorage(event);
+        storageManager = storageManager == warmStorageManager ? coldStorageManager : warmStorageManager;
+        event.setStorageManager(storageManager);
+        storage = storageManager.selectStorage(event);
         handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
         return false;
       }
-      return warmStorageManager.write(storage, handler, event);
+      return storageManager.write(storage, handler, event);
     } else {
       return storageManager.write(storage, handler, event);
     }
   }
 
   private StorageManager selectStorageManager(ShuffleDataFlushEvent event) {
+    if (event.getStorageManager() != null) {

Review Comment:
   Good idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1294556876

   @zuston Gently ping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1006506344


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   How about add two more strategies? 
   1. Just allow  fallback from local file to HDFS;
   2. Just allow  fallback from HDFS to local file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#issuecomment-1293486921

   Wait for CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005624190


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -39,13 +42,18 @@ public class MultiStorageManager implements StorageManager {
   private final StorageManager warmStorageManager;
   private final StorageManager coldStorageManager;
   private final long flushColdStorageThresholdSize;
-  private final long fallBackTimes;
+  private AbstractStorageManagerFallbackStrategy storageManagerFallbackStrategy;
+  private Cache<ShuffleDataFlushEvent, StorageManager> storageManagerCache;

Review Comment:
   Is Cache thread safe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005610651


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {
+      return current;
+    }
+    int nextIdx = -1;
+    for (int i = 0; i < candidates.length; i++) {
+      if (current == candidates[i]) {
+        nextIdx = (i + 1) % candidates.length;

Review Comment:
   I think two loops is better understood and not easy to make mistakes, and there is no difference in the performance between the two ways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r980700382


##########
server/src/main/java/org/apache/uniffle/server/ShuffleDataFlushEvent.java:
##########
@@ -90,6 +92,14 @@ public ShuffleBuffer getShuffleBuffer() {
     return shuffleBuffer;
   }
 
+  public StorageManager getStorageManager() {

Review Comment:
   > It's a little weird that ShuffleDataFlushEvent have a member field `StorageManager`.
   
   `selectStorageManager` will be called again In `org.apache.uniffle.server.storage.MultiStorageManager#updateWriteMetrics`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035589227


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   After reviewing this part, I think this is a bit strange that it changes the external objects reference of `storage` and `handler`. This will make the external invoker like `ShuffleFlushManager` confused.
   
   @jerqi @xianjingfeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002833743


##########
server/src/main/java/org/apache/uniffle/server/storage/SingleStorageManager.java:
##########
@@ -113,6 +115,22 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
     }
   }
 
+
+  @Override
+  public boolean canWrite(ShuffleDataFlushEvent event) {
+    try {
+      Storage storage = selectStorage(event);
+      // if storage is null, appId may not be registered
+      if (storage == null || !storage.canWrite()) {
+        return false;
+      }
+      return true;
+    } catch (Exception e) {
+      LOG.warn("", e);

Review Comment:
   What do you mean? RuntimeException will be throw if  storage is corrupted in ` LocalStorageManager`,  I think we should catch it here, for other  StorageManager maybe can be writed.  Do you mean we should use LOG.error?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002986068


##########
server/src/main/java/org/apache/uniffle/server/storage/AbstractStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public abstract class AbstractStorageManagerFallbackStrategy {
+  protected ShuffleServerConf conf;
+
+  public AbstractStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    this.conf = conf;
+  }
+  
+  public abstract StorageManager tryFallback(
+      StorageManager current, ShuffleDataFlushEvent event, StorageManager... options);

Review Comment:
   options -> candidates ?



##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   This original fallback strategy should also be involved in this PR. I dont know why you drop this support? 



##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... options) {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1002978607


##########
server/src/main/java/org/apache/uniffle/server/storage/SingleStorageManager.java:
##########
@@ -113,6 +115,22 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
     }
   }
 
+
+  @Override
+  public boolean canWrite(ShuffleDataFlushEvent event) {
+    try {
+      Storage storage = selectStorage(event);
+      // if storage is null, appId may not be registered
+      if (storage == null || !storage.canWrite()) {
+        return false;
+      }
+      return true;
+    } catch (Exception e) {
+      LOG.warn("", e);

Review Comment:
   We shouldn't log empty message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005375750


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   > I think the original mechanism is for when hdfs can't be writed. And i think it is involved in this strategy. In this pr, if you want to write to hdfs as much as possible when hdfs temporarily can't be write, you can set `fallBackTimes` to a larger value, and it will be same as original mechanism when `fallBackTimes` is equal to `retryMax`
   
   I was wrong. In the original mechanism, if one event fail to write to hdfs, it will not try it again, i think it is not a good mechanism. should it preserved?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1003114409


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   Directly fast fail ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005446498


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   > Because `ShuffleWriteHandler` is create by old `Storage`, we need to create a new handler with `request` after fallback. I found out a bug here just now.
   
   I found a bug in `org.apache.uniffle.storage.common.AbstractStorage#containsWriteHandler` and i will create another pr to solved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035592095


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   > After reviewing this part, I think this is a bit strange that change the external objects reference of `storage` and `handler`. This will make the external invoker like `ShuffleFlushManager` confused.
   > 
   > @jerqi @xianjingfeng
   
   We encapsulate the behavior. `ShuffleFlushManger` don't need care.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035589227


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   After reviewing this part, I think this is a bit strange that change the external objects reference of `storage` and `handler`. This will make the external invoker like `ShuffleFlushManager` confused.
   
   @jerqi @xianjingfeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1035592495


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   If you have better solution, you can propose it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #235: [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1036863645


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +95,52 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
   public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
+              event.getAppId(), event.getShuffleId(), event.getStartPartition());
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);

Review Comment:
   I have two directions to improve this part of MultiStorageManager
   
   1. Remove the `storageManagerCache` , it looks unused and bring some problems. One possible problem I have seen is that, when one event enters into pending queue due to storage cannot write, it will not have a chance to get a new fallback strategy due to cache.
   2. Avoid changing the external reference object in `write` method. If we want to fallback write to other storage. We could use the invoking sequence like. (selectStorage -> write)  if failed and then (selectStorage -> write), instead of selectStorage -> write (failed) -> write(choose other storage in this method). That means we should change the fallback strategy invoked from `write` method to `selectStorage` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r976386898


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -96,7 +96,15 @@ private StorageManager selectStorageManager(ShuffleDataFlushEvent event) {
     if (event.getSize() > flushColdStorageThresholdSize) {
       return coldStorageManager;
     } else {
-      return warmStorageManager;
+      try {

Review Comment:
   What will happen when both the cold storage manager and warm storage manager are broken? It seems to fall back many times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng closed pull request #235: Write to hdfs when local disk can't be write
URL: https://github.com/apache/incubator-uniffle/pull/235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005420205


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {

Review Comment:
   Because `ShuffleWriteHandler` is create by old `Storage`, we need to create a new handler with `request` after fallback. I found out a bug here just now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005583122


##########
server/src/main/java/org/apache/uniffle/server/storage/DefaultStorageManagerFallbackStrategy.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.uniffle.server.storage;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.uniffle.server.ShuffleDataFlushEvent;
+import org.apache.uniffle.server.ShuffleServerConf;
+
+public class DefaultStorageManagerFallbackStrategy extends AbstractStorageManagerFallbackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(DefaultStorageManagerFallbackStrategy.class);
+  private final Long fallBackTimes;
+
+  public DefaultStorageManagerFallbackStrategy(ShuffleServerConf conf) {
+    super(conf);
+    fallBackTimes = conf.get(ShuffleServerConf.FALLBACK_MAX_FAIL_TIMES);
+  }
+
+  @Override
+  public StorageManager tryFallback(StorageManager current, ShuffleDataFlushEvent event, StorageManager... candidates) {
+    if (fallBackTimes > 0
+        && (event.getRetryTimes() < fallBackTimes || event.getRetryTimes() % fallBackTimes > 0)) {
+      return current;
+    }
+    int nextIdx = -1;
+    for (int i = 0; i < candidates.length; i++) {
+      if (current == candidates[i]) {
+        nextIdx = (i + 1) % candidates.length;

Review Comment:
   Could we merge these two loops  into one loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1006440579


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -39,13 +42,18 @@ public class MultiStorageManager implements StorageManager {
   private final StorageManager warmStorageManager;
   private final StorageManager coldStorageManager;
   private final long flushColdStorageThresholdSize;
-  private final long fallBackTimes;
+  private AbstractStorageManagerFallbackStrategy storageManagerFallbackStrategy;
+  private Cache<ShuffleDataFlushEvent, StorageManager> storageManagerCache;

Review Comment:
   Yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1005167146


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -65,39 +73,51 @@ public Storage selectStorage(ShuffleDataReadEvent event) {
 
   @Override
   public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
-    selectStorageManager(event).updateWriteMetrics(event, writeTime);
+    throw new UnsupportedOperationException();
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {
+    if (event.getRetryTimes() > 0) {
       try {
-        CreateShuffleWriteHandlerRequest request = storage.getCreateWriterHandlerRequest(
-            event.getAppId(),
-            event.getShuffleId(),
-            event.getStartPartition());
-        if (request == null) {
-          return false;
+        StorageManager newStorageManager = storageManagerFallbackStrategy.tryFallback(
+                storageManager, event, warmStorageManager, coldStorageManager);
+        if (newStorageManager != storageManager) {
+          storageManager = newStorageManager;
+          storageManagerCache.put(event, storageManager);
+          storage = storageManager.selectStorage(event);
+          handler = storage.getOrCreateWriteHandler(request);
         }
-        storage = warmStorageManager.selectStorage(event);
-        handler = storage.getOrCreateWriteHandler(request);
       } catch (IOException ioe) {
         LOG.warn("Create fallback write handler failed ", ioe);
-        return false;
       }
-      return warmStorageManager.write(storage, handler, event);
-    } else {
-      return storageManager.write(storage, handler, event);
     }
+    boolean success = storageManager.write(storage, handler, event, request);

Review Comment:
   Whether do we remove storageManagerCache if we can update metrics here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] jerqi commented on a diff in pull request #235: Write to hdfs when local disk can't be write

Posted by GitBox <gi...@apache.org>.
jerqi commented on code in PR #235:
URL: https://github.com/apache/incubator-uniffle/pull/235#discussion_r1006507657


##########
server/src/main/java/org/apache/uniffle/server/storage/MultiStorageManager.java:
##########
@@ -69,35 +71,44 @@ public void updateWriteMetrics(ShuffleDataFlushEvent event, long writeTime) {
   }
 
   @Override
-  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event) {
+  public boolean write(Storage storage, ShuffleWriteHandler handler, ShuffleDataFlushEvent event, 
+                       CreateShuffleWriteHandlerRequest request) {
     StorageManager storageManager = selectStorageManager(event);
-    if (storageManager == coldStorageManager && event.getRetryTimes() > fallBackTimes) {

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org