You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@uniffle.apache.org by GitBox <gi...@apache.org> on 2023/01/17 03:58:11 UTC

[GitHub] [incubator-uniffle] zuston opened a new pull request, #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

zuston opened a new pull request, #494:
URL: https://github.com/apache/incubator-uniffle/pull/494

   ### What changes were proposed in this pull request?
   Introduce more metrics about huge partition  
   __counter__  
   1. total_require_buffer_failed_for_huge_partition
   2. total_require_buffer_failed_for_regular_partition
   3. total_app_num
   4. total_app_with_huge_partition_num
   5. total_partition_num
   6. total_huge_partition_num
   
   __Gauge__  
   1. huge_partition_num 
   2. app_with_huge_partition_num
   
   ### Why are the changes needed?
   Having these metrics, we should observe the concrete influence from huge partition for regular partition and huge partition number in one shuffle-server
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   1. UTs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080767389


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !hugePartitionTags.isEmpty();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080859913


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,80 @@
 
 package org.apache.uniffle.server;
 
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.CyclicBarrier;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.IntStream;
+
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeEach
+  public void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterEach
+  public void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionConcurrentTest() throws InterruptedException {
+    ShuffleTaskInfo shuffleTaskInfo = new ShuffleTaskInfo("hugePartitionConcurrentTest_appId");
+
+    int n = 10;
+    final CyclicBarrier barrier = new CyclicBarrier(n);
+    final CountDownLatch countDownLatch = new CountDownLatch(n);
+    ExecutorService executorService = Executors.newFixedThreadPool(n);
+    IntStream.range(0, n).forEach(i -> executorService.submit(() -> {
+      try {
+        barrier.await();
+        shuffleTaskInfo.markHugePartition(i, i);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080755756


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return existHugePartition;
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {
+      synchronized (this) {
+        if (!existHugePartition) {
+          ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
+          ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
+          existHugePartition = true;
+        }
+      }
+    }
+
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> Maps.newConcurrentMap());
+
+    hugePartitionTags.get(shuffleId).computeIfAbsent(partitionId, key -> {

Review Comment:
   > You can leverage set's  boolean add(E e); method's return boolean to update metrics conditionally
   
   Got it. ConcurrentHashSet is adapted for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073298137


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > Yes, this class name is a litter bit misleading. Let's rename it in another PR? Do you want to push this?
   
   Of course let's do it in another issue/pr, and we don't need to rush it. A first step to add more comment about this class should be good to go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073331201


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   I will review it tonight. Overall it looks good to me despite the shuffleTaskInfos issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072017588


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   This is a counter metric.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072022026


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {
+      return 0;
+    }
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();

Review Comment:
   Nice catch. You are right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073244154


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > One appId could had multiple shuffleTaskInfos, right?
   
   One appId only has one shuffleTaskInfo.
   
   > Once shuffleBufferManager detects a huge partition, it cloud increase the app_num_with_huge_partition metrics if the appId is not already added to hasSeenAppIdSet.
   
   How to maintain `hasSeenAppIdSet`'s lifecycle? Should one element of one appId in  `hasSeenAppIdSet` keep consistent with its `shuffleTaskInfo` lifecycle? If yes, it also will be removed due to heartbeat timeout. Otherwise, we have to maintain `hasSeenAppIdSet` lifecycle independently, this is too complex and unreasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073303230


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   Yeah.. I'm ok to accept lose of accuracy for metrics.
   
   I think you can leave it as it is, but open a new issue for this kind of case. It might be inevitable to add `appIdHasSeen` in the end. But let's defer that effot



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#issuecomment-1396518762

   LGTM, pending CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080759227


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !hugePartitionTags.isEmpty();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {

Review Comment:
   `CompareAndSet` will be better. Right? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#issuecomment-1396488098

   Generally lgtm, except two minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072115369


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   Did a overview of the current code. I think this's the better place to update related metrics here.
   Once shuffleBufferManager detects a huge partition, it cloud increase the app_num_with_huge_partition metrics if the appId is not already added to `hasSeenAppIdSet`.
   
   `public void markHugePartition(int shuffleId, int partitionId) {`
   could be change its signature to 
   `public boolean markHugePartition(int shuffleId, int partitionId) {`
   which returns whether the current partition has already been marked or not. The logging and metrics inc should be updated here.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !hugePartitionTags.isEmpty();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {

Review Comment:
   To be honest, this is quite ugly..😂😇



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   Therefore, the appId in `ShuffleTaskInfo` could be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073250262


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > My whole point is that shuffleTaskInfos cannot be fully trusted to deduplicate appId.
   
   Yes, the problem you mentioned is a corner case. But for metric calculating, I don't think we should make too much effort and introduce too much complexity to improve accuracy for 0.1 percent possibility.
   
   For this case, I could delete this metric of total_app_num, leaving someone having better solution to support



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072038093


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] codecov-commenter commented on pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#issuecomment-1384811192

   # [Codecov](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#494](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (14ff01e) into [master](https://codecov.io/gh/apache/incubator-uniffle/commit/849993c26d06c422f38f0bb242af00fb2e341290?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (849993c) will **decrease** coverage by `0.50%`.
   > The diff coverage is `90.74%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master     #494      +/-   ##
   ============================================
   - Coverage     58.83%   58.33%   -0.51%     
   + Complexity     1707     1502     -205     
   ============================================
     Files           206      184      -22     
     Lines         11508     9716    -1792     
     Branches       1030      875     -155     
   ============================================
   - Hits           6771     5668    -1103     
   + Misses         4323     3675     -648     
   + Partials        414      373      -41     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [.../org/apache/uniffle/server/ShuffleTaskManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9TaHVmZmxlVGFza01hbmFnZXIuamF2YQ==) | `76.76% <85.71%> (+0.16%)` | :arrow_up: |
   | [...ava/org/apache/uniffle/server/ShuffleTaskInfo.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9TaHVmZmxlVGFza0luZm8uamF2YQ==) | `96.29% <90.47%> (-3.71%)` | :arrow_down: |
   | [...rg/apache/uniffle/server/ShuffleServerMetrics.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9TaHVmZmxlU2VydmVyTWV0cmljcy5qYXZh) | `97.32% <100.00%> (+0.26%)` | :arrow_up: |
   | [...he/uniffle/server/buffer/ShuffleBufferManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS91bmlmZmxlL3NlcnZlci9idWZmZXIvU2h1ZmZsZUJ1ZmZlck1hbmFnZXIuamF2YQ==) | `83.33% <100.00%> (+0.11%)` | :arrow_up: |
   | [.../hadoop/mapreduce/task/reduce/RssBypassWriter.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50LW1yL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvbWFwcmVkdWNlL3Rhc2svcmVkdWNlL1Jzc0J5cGFzc1dyaXRlci5qYXZh) | | |
   | [...g/apache/hadoop/mapred/SortWriteBufferManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50LW1yL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvbWFwcmVkL1NvcnRXcml0ZUJ1ZmZlck1hbmFnZXIuamF2YQ==) | | |
   | [...pache/spark/shuffle/writer/WriteBufferManager.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50LXNwYXJrL2NvbW1vbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvc3Bhcmsvc2h1ZmZsZS93cml0ZXIvV3JpdGVCdWZmZXJNYW5hZ2VyLmphdmE=) | | |
   | [...ava/org/apache/spark/shuffle/RssShuffleHandle.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50LXNwYXJrL2NvbW1vbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvc3Bhcmsvc2h1ZmZsZS9Sc3NTaHVmZmxlSGFuZGxlLmphdmE=) | | |
   | [.../java/org/apache/hadoop/mapreduce/RssMRConfig.java](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Y2xpZW50LW1yL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9oYWRvb3AvbWFwcmVkdWNlL1Jzc01SQ29uZmlnLmphdmE=) | | |
   | ... and [17 more](https://codecov.io/gh/apache/incubator-uniffle/pull/494?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston merged pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by "zuston (via GitHub)" <gi...@apache.org>.
zuston merged PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072037812


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -52,15 +58,18 @@ public class ShuffleTaskInfo {
    * shuffleId -> partitionId -> partition shuffle data size
    */
   private Map<Integer, Map<Integer, Long>> partitionDataSizes;
+  private Map<Integer, Set<Integer>> hugePartitionTags;

Review Comment:
   Updated.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {
+      return 0;
+    }
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();

Review Comment:
   Done



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072037607


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {
+      return 0;
+    }
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
+      ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
+      return Sets.newConcurrentHashSet();
+    });
+    Set<Integer> partitions = hugePartitionTags.get(shuffleId);
+    if (partitions.contains(partitionId)) {
+      return;
+    }
+    partitions.add(partitionId);
+    ShuffleServerMetrics.counterTotalHugePartitionNum.inc();

Review Comment:
   Updated. PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072218088


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   If it is removed and then registered, that means the data will be lost, this is a critical bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073276149


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > My fault. The class name is kind of misleading, I used to think this class corresponding to one shuffle process/task, mapping to one shuffleId in spark app.
   
   Yes, this class name is a litter bit misleading. Let's rename it in another PR? Do you want to push this?
   
   > But that may be a little bit of complex..
   
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#issuecomment-1397805494

   PTAL @advancedxy . CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072019450


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -52,15 +58,18 @@ public class ShuffleTaskInfo {
    * shuffleId -> partitionId -> partition shuffle data size
    */
   private Map<Integer, Map<Integer, Long>> partitionDataSizes;
+  private Map<Integer, Set<Integer>> hugePartitionTags;

Review Comment:
   This data structure is to record the huge partitions to avoid calculating duplicate metrics



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072016260


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072085066


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This might be inaccurate
   
   This problem is not addressed.
   `refreshAppId` could be called by multiple calls. For example, a spark app finishes an stage(with shuffle), then some local computing then submit new stage(with shuffle). The shuffle server will account this app multiple times.
   
   I can think of two proposals to fix this problem:
   1. Modify the `ShuffleRegisterRequest` in proto to include a `firstShuffle` boolean field to indicate this is the first shuffle register, mapping to the app.
   2. Shuffle server maintains another data structure to maintain which appIds has been seen. But it's not evicted until the configured size is reached, let's say 100_000?
   Both two proposals require the metric inc should be happened in `registerShuffle` call.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This might be inaccurate
   
   This problem is not addressed.
   `refreshAppId` could be called by multiple calls. For example, a spark app finishes an stage(with shuffle), then some local computing then submit new stage(with shuffle). The shuffle server will account this app multiple times.
   
   I can think of two proposals to fix this problem:
   1. Modify the `ShuffleRegisterRequest` in proto to include a `firstShuffle` boolean field to indicate this is the first shuffle register, mapping to the app.
   2. Shuffle server maintains another data structure to maintain which appIds has been seen. But it's not evicted until the configured size is reached, let's say 100_000?
   3. 
   Both two proposals require the metric inc should be happened in `registerShuffle` call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080845427


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,80 @@
 
 package org.apache.uniffle.server;
 
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.CyclicBarrier;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.IntStream;
+
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeEach
+  public void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterEach
+  public void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionConcurrentTest() throws InterruptedException {
+    ShuffleTaskInfo shuffleTaskInfo = new ShuffleTaskInfo("hugePartitionConcurrentTest_appId");
+
+    int n = 10;
+    final CyclicBarrier barrier = new CyclicBarrier(n);
+    final CountDownLatch countDownLatch = new CountDownLatch(n);
+    ExecutorService executorService = Executors.newFixedThreadPool(n);
+    IntStream.range(0, n).forEach(i -> executorService.submit(() -> {
+      try {
+        barrier.await();
+        shuffleTaskInfo.markHugePartition(i, i);

Review Comment:
   is it possible to mark the same huge partition multiple times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072220718


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !hugePartitionTags.isEmpty();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {

Review Comment:
   Emm… Maybe it is. 🥲



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072016037


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072085066


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This might be inaccurate
   
   This problem is not addressed.
   `refreshAppId` could be called by multiple calls. For example, a spark app finishes an stage(with shuffle), then some local computing then submit new stage(with shuffle). The shuffle server will account this app multiple times.
   
   I can think of two proposals to fix this problem:
   1. Modify the `ShuffleRegisterRequest` in proto to include a `firstShuffle` boolean field to indicate this is the first shuffle register, mapping to the app.
   2. Shuffle server maintains another data structure to maintain which appIds has been seen. But it's not evicted until the configured size is reached, let's say 100_000?
   
   Both two proposals require the metric inc should be happened in `registerShuffle` call.
   
   The first one should fix the problem permanently, but require bigger effort.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073273322


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > One appId only has one shuffleTaskInfo.
   
   My fault. The class name is kind of misleading, I used to think this class corresponding to one shuffle process/task, mapping to one shuffleId in spark app.
   
   > How to maintain `hasSeenAppIdSet`'s lifecycle?
   
   If we are going to this way. `hasSeenAppIdSet` would be a LRU with maximum size(let's say 10W) set. It should be independent with `shuffleTaskInfos`.
   
   But that may be a little bit complex..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073273322


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > One appId only has one shuffleTaskInfo.
   
   My fault. The class name is kind of misleading, I used to think this class corresponding to one shuffle process/task, mapping to one shuffleId in spark app.
   
   > How to maintain `hasSeenAppIdSet`'s lifecycle?
   
   If we are going to this way. `hasSeenAppIdSet` would be a LRU with maximum size(let's say 10W) set. It should be independent with `shuffleTaskInfos`.
   
   But that may be a little bit of complex..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1071969469


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {

Review Comment:
   hugePartitionTags couldn't be null, right?



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {
+      return 0;
+    }
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();

Review Comment:
   If I understand the name correctly, `gaugeAppWithHugePartitionNum` should be the app number with huge partitions.
   
   However, this increase this gauge once a huge partition is occurred in one shuffle. There might be multiple shuffles and each has huge partitions.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   This might be inaccurate. And we already has an org.apache.uniffle.server.ShuffleServerMetrics#gaugeAppNum 
   in metrics?



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {

Review Comment:
   According to calling method, this should be `getHugePartitionNum`.
   The size is ambiguous as it also could refer to the actual partition data size.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);

Review Comment:
   I prefer `!hugePartitionTags.empty()`



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -52,15 +58,18 @@ public class ShuffleTaskInfo {
    * shuffleId -> partitionId -> partition shuffle data size
    */
   private Map<Integer, Map<Integer, Long>> partitionDataSizes;
+  private Map<Integer, Set<Integer>> hugePartitionTags;

Review Comment:
   Some doc or comment here?
   
   Also, is it possible to reuse the partitionDataSizes to calculate the huge partitions?
   This is in shuffle server, I'm very concerned about the memory consumption of related metadata.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +133,30 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !(hugePartitionTags.size() == 0);
+  }
+
+  public int getHugePartitionSize() {
+    if (hugePartitionTags == null) {
+      return 0;
+    }
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
+      ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
+      return Sets.newConcurrentHashSet();
+    });
+    Set<Integer> partitions = hugePartitionTags.get(shuffleId);
+    if (partitions.contains(partitionId)) {
+      return;
+    }
+    partitions.add(partitionId);
+    ShuffleServerMetrics.counterTotalHugePartitionNum.inc();

Review Comment:
   is is quite possible that same partition id is marked multiple times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072298459


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   no heartbeat could be a client problem..Such as the driver finishes one stage, then it full gced for a long time for its own compute logic..
   
   My whole point is that shuffleTaskInfos cannot be fully trusted to deduplicate appId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072224949


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   > I think this's the better place to update related metrics here.
   
   Sounds good. Let me take a look tomorrow 
   
   
   > Once shuffleBufferManager detects a huge partition, it cloud increase the app_num_with_huge_partition metrics if the appId is not already added to hasSeenAppIdSet.
   
   Actually I don’t hope we maintain the  
   hasSeenAppIdSet data structure in shuffleBufferManager. This info bound to shuffleTaskInfo will be better ? WDYT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080761144


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,47 @@
 
 package org.apache.uniffle.server;
 
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeAll
+  public static void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterAll
+  public static void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionTest() {

Review Comment:
   Yes. I will
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080767249


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,47 @@
 
 package org.apache.uniffle.server;
 
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeAll
+  public static void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterAll
+  public static void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionTest() {

Review Comment:
   Done



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return existHugePartition;
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {
+      synchronized (this) {
+        if (!existHugePartition) {
+          ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
+          ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
+          existHugePartition = true;
+        }
+      }
+    }
+
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> Maps.newConcurrentMap());
+
+    hugePartitionTags.get(shuffleId).computeIfAbsent(partitionId, key -> {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072305338


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   The current impl doesn’t handle appId correctly?
   
   One appId could had multiple shuffleTaskInfos, right? If so, you cannot rely on shuffletaskinfo to deduplicate appId?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072299512


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > If it is removed and then registered, that means the data will be lost, this is a critical bug.
   
   The case I described, there’s no data lose. And mostly case, the app shall be failed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1073315341


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -322,16 +322,22 @@ public void updateCachedBlockIds(String appId, int shuffleId, int partitionId, S
         size += spb.getSize();
       }
     }
-    shuffleTaskInfo.addPartitionDataSize(
+    long partitionSize = shuffleTaskInfo.addPartitionDataSize(
         shuffleId,
         partitionId,
         size
     );
+    if (shuffleBufferManager.isHugePartition(partitionSize)) {

Review Comment:
   Do you have any ideas about current implementation? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072020335


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -52,15 +58,18 @@ public class ShuffleTaskInfo {
    * shuffleId -> partitionId -> partition shuffle data size
    */
   private Map<Integer, Map<Integer, Long>> partitionDataSizes;
+  private Map<Integer, Set<Integer>> hugePartitionTags;

Review Comment:
   And for us, the memory occupied is acceptable for us, the max concurrent partition size is 5k.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072120999


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + (System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This operation is wrapped in computeIfAbsent, so multiple times calling won't cause duplicate calculating. Right?
   
   Mostly yes. But please keep in mind, the appId is removed by `expiredAppCleanupExecutorService` from shuffleTaskInfos if no hearbeat for over 60s(default value). So I believe some appId would be removed then registered again...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080750076


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,47 @@
 
 package org.apache.uniffle.server;
 
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeAll
+  public static void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterAll
+  public static void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionTest() {

Review Comment:
   If it's not too hard or too much effort, tests with concurrent write/update are preferred.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return existHugePartition;
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {
+      synchronized (this) {
+        if (!existHugePartition) {
+          ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
+          ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
+          existHugePartition = true;
+        }
+      }
+    }
+
+    hugePartitionTags.computeIfAbsent(shuffleId, key -> Maps.newConcurrentMap());
+
+    hugePartitionTags.get(shuffleId).computeIfAbsent(partitionId, key -> {

Review Comment:
   I prefer a set here. Map<Integer, Boolean> has more memory overhead compared to Set<Integer>
   
   You can leverage set's ` boolean add(E e);` method's return boolean to update metrics conditionally



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +136,32 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return !hugePartitionTags.isEmpty();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition) {

Review Comment:
   Could we use a `AtomicBoolean` and its `getAndSet` method to update app count atomically?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080842169


##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskInfo.java:
##########
@@ -124,4 +139,28 @@ public long getPartitionDataSize(int shuffleId, int partitionId) {
     return size;
   }
 
+  public boolean hasHugePartition() {
+    return existHugePartition.get();
+  }
+
+  public int getHugePartitionSize() {
+    return hugePartitionTags.values().stream().map(x -> x.size()).reduce((x, y) -> x + y).orElse(0);
+  }
+
+  public void markHugePartition(int shuffleId, int partitionId) {
+    if (!existHugePartition.get()) {

Review Comment:
   Previously, I was thinking:
   
   ```
   if (!existHugePartition.getAndSet(true)) {
           ShuffleServerMetrics.gaugeAppWithHugePartitionNum.inc();
           ShuffleServerMetrics.counterTotalAppWithHugePartitionNum.inc();
   }
   ```
   
   But maybe get is cheaper, and we should guide by it first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org


[GitHub] [incubator-uniffle] zuston commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Posted by GitBox <gi...@apache.org>.
zuston commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1080857003


##########
server/src/test/java/org/apache/uniffle/server/ShuffleTaskInfoTest.java:
##########
@@ -17,15 +17,80 @@
 
 package org.apache.uniffle.server;
 
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.CyclicBarrier;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.IntStream;
+
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class ShuffleTaskInfoTest {
 
+  @BeforeEach
+  public void setup() {
+    ShuffleServerMetrics.register();
+  }
+
+  @AfterEach
+  public void tearDown() {
+    ShuffleServerMetrics.clear();
+  }
+
+  @Test
+  public void hugePartitionConcurrentTest() throws InterruptedException {
+    ShuffleTaskInfo shuffleTaskInfo = new ShuffleTaskInfo("hugePartitionConcurrentTest_appId");
+
+    int n = 10;
+    final CyclicBarrier barrier = new CyclicBarrier(n);
+    final CountDownLatch countDownLatch = new CountDownLatch(n);
+    ExecutorService executorService = Executors.newFixedThreadPool(n);
+    IntStream.range(0, n).forEach(i -> executorService.submit(() -> {
+      try {
+        barrier.await();
+        shuffleTaskInfo.markHugePartition(i, i);

Review Comment:
   Yes. Let me add more test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org