You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/24 04:55:42 UTC

[GitHub] [spark] gengliangwang opened a new pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

gengliangwang opened a new pull request #34092:
URL: https://github.com/apache/spark/pull/34092


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Add new unit test for the InMemoryStore.
   Also, run a simple benchmark with 
   ```
       val testConf = conf.clone()
         .set(MAX_RETAINED_STAGES, 1000)
   
       val listener = new AppStatusListener(store, testConf, true)
       val stages = (1 to 3000).map { i =>
         new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1",
           resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
       }
       listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
         stages.foreach { s =>
           time +=1
           s.submissionTime = Some(time)
           listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties()))
           s.completionTime = Some(time)
           listener.onStageCompleted(SparkListenerStageCompleted(s))
         }
   ```
   
   Before changes:
   InMemoryStore: 2.8s
   LevelDB: 68.9s
   
   After changes:
   InMemoryStore: 0.95s
   LevelDB: 60.3s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926396839


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48098/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926385613


   @mridulm Thanks for the review. This one is ready now.
   As I observe regression after applying the changes when using LevelDB, this PR is to optimize InMemoryStore only.
   Now the PR is safe enough to go into Spark 3.2 as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r716003560



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1294,6 +1330,22 @@ private[spark] class AppStatusListener(
       cleanupCachedQuantiles(key)
       key
     }
+  }
+
+  private def cleanupStages(count: Long): Unit = {
+    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
+    if (countToDelete <= 0L) {
+      return
+    }
+
+    // SPARK-36827: For better performance and avoiding OOM, here we use a optimized method for
+    //              cleaning the StageDataWrapper and RDDOperationGraphWrapper data if Spark is
+    //              using InMemoryStore.
+    val stageIds = if (kvstore.usingInMemoryStore) {
+      cleanupStagesWithInMemoryStore(countToDelete)

Review comment:
       @srowen +1. I think we can have an enhanced version of in-memory `ElementTrackingStore`.
   I was trying to merge this one in 3.2.0 so this is a temporary solution.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715321250



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
     val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
     if (countToDelete <= 0L) {
       return
     }
 
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+      // Here we keep track of the total number of StageDataWrapper entries for each stage id.
+      // This will be used in cleaning up the RDDOperationGraphWrapper data.
+      if (stageDataCount.contains(s.info.stageId)) {
+        stageDataCount(s.info.stageId) += 1
+      } else {
+        stageDataCount(s.info.stageId) = 1
+      }
+      if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) {
+        val candidate =
+          StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime)
+        stageArray.append(candidate)
+      }
+    }
+
     // As the completion time of a skipped stage is always -1, we will remove skipped stages first.
     // This is safe since the job itself contains enough information to render skipped stages in the
     // UI.
-    val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-      s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING
-    }
-
-    val stageIds = stages.map { s =>

Review comment:
       I thought about keeping the original code for LevelDB here. But after investigation, I find that:
   The default retained stages size is 1000, so as per
   ```
     private def calculateNumberToRemove(dataSize: Long, retainedSize: Long): Long = {
       if (dataSize > retainedSize) {
         math.max(retainedSize / 10L, dataSize - retainedSize)
       } else {
         0L
       }
     }
   ```
   The `stages` here normally has a length of 100. Finding stage id inside LevelDB 100 times is not efficient, comparing to the new code. 
   So I decide to make it simple and use the same code for both InMemoryStore and LevelDB.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926573230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143597/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-927024101


   @zhouyejoe I use the benchmark code in the PR description, which is runnable under `AppStatusListenerSuite`.
   I tried 3000 and 5000 stages the result "after changes" is always slightly slow. 
   When the number of stages reaches 1001, Spark will delete the oldest 100 stages UI data. I guess after changes reading 1001 stages from LevelDB is slow. Before changes Spark only reads 100 stages from LevelDB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926461313


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48104/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926367597


   Did not notice the change was still WIP !
   But the fix looks good to me - though I am not sure if it is sufficient to fix the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926412080


   > One option is to batch all the deletes together, instead of doing it one by one - which should make level db updates also faster ?
   
   Then the same optimization can be done in the `cleanupStagesInKVStore` as well and the minor perf difference is still there. We can try that in another PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926566430


   **[Test build #143597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143597/testReport)** for PR 34092 at commit [`da9fe40`](https://github.com/apache/spark/commit/da9fe401e6d9002e766a13e22eff9e511bcf3df0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926502009


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926473057


   **[Test build #143597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143597/testReport)** for PR 34092 at commit [`da9fe40`](https://github.com/apache/spark/commit/da9fe401e6d9002e766a13e22eff9e511bcf3df0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926540470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715613205



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1294,6 +1330,22 @@ private[spark] class AppStatusListener(
       cleanupCachedQuantiles(key)
       key
     }
+  }
+
+  private def cleanupStages(count: Long): Unit = {
+    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
+    if (countToDelete <= 0L) {
+      return
+    }
+
+    // SPARK-36827: For better performance and avoiding OOM, here we use a optimized method for
+    //              cleaning the StageDataWrapper and RDDOperationGraphWrapper data if Spark is
+    //              using InMemoryStore.
+    val stageIds = if (kvstore.usingInMemoryStore) {
+      cleanupStagesWithInMemoryStore(countToDelete)

Review comment:
       +1 for @srowen 's comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926419740


   **[Test build #143587 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143587/testReport)** for PR 34092 at commit [`b270d2b`](https://github.com/apache/spark/commit/b270d2b6334b5373265735f3a86faf51df015ccc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm edited a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm edited a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926406914


   I am guessing the minor perf difference for level db would be the sort being done after pulling all data vs pushing the sort to the store.
   One option is to batch all the deletes together, instead of doing it one by one - which should make level db updates also faster ?
   
   For reference, there is a `LevelDB.writeAll` that was added - something similar for delete should help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715646989



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1294,6 +1330,22 @@ private[spark] class AppStatusListener(
       cleanupCachedQuantiles(key)
       key
     }
+  }
+
+  private def cleanupStages(count: Long): Unit = {
+    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
+    if (countToDelete <= 0L) {
+      return
+    }
+
+    // SPARK-36827: For better performance and avoiding OOM, here we use a optimized method for
+    //              cleaning the StageDataWrapper and RDDOperationGraphWrapper data if Spark is
+    //              using InMemoryStore.
+    val stageIds = if (kvstore.usingInMemoryStore) {
+      cleanupStagesWithInMemoryStore(countToDelete)

Review comment:
       maybe we need a more high-level API in the kv store to clean up stages?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926375480


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48098/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926368948


   @mridulm there is something wrong with the benchmark. I rerun the updated benchmark and find that after changes the LevelDB is slower. 
   I am considering making the code changes for InMemoryStore only. So I mark it as WIP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] taroplus commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
taroplus commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926349872


   @gengliangwang sounds good, thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926406914


   I am guessing the minor perf difference for level db would be the sort being done after pulling all data vs pushing the sort to the store.
   One option is to batch all the deletes together, instead of doing it one by one - which should make level db updates also faster ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926396888


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48098/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926397098


   **[Test build #143592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143592/testReport)** for PR 34092 at commit [`fa7a7ff`](https://github.com/apache/spark/commit/fa7a7ff32a874d5d228ab98364889635c92db16c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926469314


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48104/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926469314


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48104/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926473057


   **[Test build #143597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143597/testReport)** for PR 34092 at commit [`da9fe40`](https://github.com/apache/spark/commit/da9fe401e6d9002e766a13e22eff9e511bcf3df0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926507138


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143592/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926422744


   @mridulm Yes. 
   At first the benchmark of job start:
   ```
       listener.onJobStart(SparkListenerJobStart(4, time, stages, null))
   ```
   
   Later on, I change the stage info inside to `Nil`
   ```
       listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
   ```
   and the difference for LevelDB is
   before changes: 2.5s
   after changes: 3s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926431167


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48104/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926507138


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143592/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926500916


   **[Test build #143592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143592/testReport)** for PR 34092 at commit [`fa7a7ff`](https://github.com/apache/spark/commit/fa7a7ff32a874d5d228ab98364889635c92db16c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715321250



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
     val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
     if (countToDelete <= 0L) {
       return
     }
 
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+      // Here we keep track of the total number of StageDataWrapper entries for each stage id.
+      // This will be used in cleaning up the RDDOperationGraphWrapper data.
+      if (stageDataCount.contains(s.info.stageId)) {
+        stageDataCount(s.info.stageId) += 1
+      } else {
+        stageDataCount(s.info.stageId) = 1
+      }
+      if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) {
+        val candidate =
+          StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime)
+        stageArray.append(candidate)
+      }
+    }
+
     // As the completion time of a skipped stage is always -1, we will remove skipped stages first.
     // This is safe since the job itself contains enough information to render skipped stages in the
     // UI.
-    val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-      s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING
-    }
-
-    val stageIds = stages.map { s =>

Review comment:
       I thought about keeping the original code for LevelDB here. But after investigation, I find that:
   The default retained stages size is 1000, so as per
   ```
     private def calculateNumberToRemove(dataSize: Long, retainedSize: Long): Long = {
       if (dataSize > retainedSize) {
         math.max(retainedSize / 10L, dataSize - retainedSize)
       } else {
         0L
       }
     }
   ```
   The `stages` here normally has a length of 100. Finding stage id inside LevelDB 100 times is not efficient, comparing to the new code. 
   So I decide to make it simple and use the same code for both InMemoryStore and LevelDB.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715571536



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1294,6 +1330,22 @@ private[spark] class AppStatusListener(
       cleanupCachedQuantiles(key)
       key
     }
+  }
+
+  private def cleanupStages(count: Long): Unit = {
+    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
+    if (countToDelete <= 0L) {
+      return
+    }
+
+    // SPARK-36827: For better performance and avoiding OOM, here we use a optimized method for
+    //              cleaning the StageDataWrapper and RDDOperationGraphWrapper data if Spark is
+    //              using InMemoryStore.
+    val stageIds = if (kvstore.usingInMemoryStore) {
+      cleanupStagesWithInMemoryStore(countToDelete)

Review comment:
       Could this be pushed down to the in memory store? vs special casing it here. This doesn't seem very object-oriented




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926397098


   **[Test build #143592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143592/testReport)** for PR 34092 at commit [`fa7a7ff`](https://github.com/apache/spark/commit/fa7a7ff32a874d5d228ab98364889635c92db16c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715332397



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
     val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
     if (countToDelete <= 0L) {
       return
     }
 
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+      // Here we keep track of the total number of StageDataWrapper entries for each stage id.
+      // This will be used in cleaning up the RDDOperationGraphWrapper data.
+      if (stageDataCount.contains(s.info.stageId)) {
+        stageDataCount(s.info.stageId) += 1
+      } else {
+        stageDataCount(s.info.stageId) = 1
+      }
+      if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) {
+        val candidate =
+          StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime)
+        stageArray.append(candidate)
+      }
+    }
+
     // As the completion time of a skipped stage is always -1, we will remove skipped stages first.
     // This is safe since the job itself contains enough information to render skipped stages in the
     // UI.
-    val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-      s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING
-    }
-
-    val stageIds = stages.map { s =>

Review comment:
       I am trying to understand the last part - what is the difference w.r.t new code for finding stage id ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926349734


   **[Test build #143587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143587/testReport)** for PR 34092 at commit [`b270d2b`](https://github.com/apache/spark/commit/b270d2b6334b5373265735f3a86faf51df015ccc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhouyejoe commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
zhouyejoe commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-927022418


   @gengliangwang 
   QQ regarding the regression test:
   
   > the difference for LevelDB is
   > before changes: 2.5s
   > after changes: 3s
   
   How many stages within your test log file for SHS? Or will this performance regression get amplified if there are more stages within the application?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926633247


   +1, LGTM, too. Thank you, @gengliangwang and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926573230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143597/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715616227



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1294,6 +1330,22 @@ private[spark] class AppStatusListener(
       cleanupCachedQuantiles(key)
       key
     }
+  }
+
+  private def cleanupStages(count: Long): Unit = {
+    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
+    if (countToDelete <= 0L) {
+      return
+    }
+
+    // SPARK-36827: For better performance and avoiding OOM, here we use a optimized method for

Review comment:
       I believe we can remove `SPARK-36827: ` and the below indentations inside this comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926363391


   +CC @zhouyejoe, @thejdeep 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #34092:
URL: https://github.com/apache/spark/pull/34092


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926413763


   Btw, can you share the details of the regression ?
   What I had seen earlier was:
   ```
   Before changes:
   InMemoryStore: 2.8s
   LevelDB: 68.9s
   
   After changes:
   InMemoryStore: 0.95s
   LevelDB: 60.3s
   ```
   
   That looked fine to me - did we observe a degradation in timing results after ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926430904


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143587/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926527507


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715428573



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,12 +1254,47 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
-  private def cleanupStages(count: Long): Unit = {
-    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
-    if (countToDelete <= 0L) {
-      return
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
+  private def cleanupStagesWithInMemoryStore(countToDelete: Long): Seq[Array[Int]] = {
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>

Review comment:
       I think `KVUtils.viewToSeq(view, countToDelete.toInt)` is better and cleaner if the in-memory store implements it with top-n algorithm. We can do it in followups.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926540470


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926349734


   **[Test build #143587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143587/testReport)** for PR 34092 at commit [`b270d2b`](https://github.com/apache/spark/commit/b270d2b6334b5373265735f3a86faf51df015ccc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926347002


   @taroplus I was working on this yesterday. I didn't send it out because I think we can do better if we build a live priority queue in `AppStatusListener` so that Spark doesn't need to pull all the stage data out on every cleaning up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715434302



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,12 +1254,47 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
-  private def cleanupStages(count: Long): Unit = {
-    val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
-    if (countToDelete <= 0L) {
-      return
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
+  private def cleanupStagesWithInMemoryStore(countToDelete: Long): Seq[Array[Int]] = {
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+      // Here we keep track of the total number of StageDataWrapper entries for each stage id.
+      // This will be used in cleaning up the RDDOperationGraphWrapper data.
+      if (stageDataCount.contains(s.info.stageId)) {
+        stageDataCount(s.info.stageId) += 1
+      } else {
+        stageDataCount(s.info.stageId) = 1
+      }

Review comment:
       this's only needed for completed stages?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926396888


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48098/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926430904


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143587/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715332652



##########
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##########
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
     toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+      stageId: Int,
+      attemptId: Int,
+      completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
     val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES))
     if (countToDelete <= 0L) {
       return
     }
 
+    val stageArray = new ArrayBuffer[StageCompletionTime]()
+    val stageDataCount = new mutable.HashMap[Int, Int]()
+    kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+      // Here we keep track of the total number of StageDataWrapper entries for each stage id.
+      // This will be used in cleaning up the RDDOperationGraphWrapper data.
+      if (stageDataCount.contains(s.info.stageId)) {
+        stageDataCount(s.info.stageId) += 1
+      } else {
+        stageDataCount(s.info.stageId) = 1
+      }
+      if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) {
+        val candidate =
+          StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime)
+        stageArray.append(candidate)
+      }
+    }
+
     // As the completion time of a skipped stage is always -1, we will remove skipped stages first.
     // This is safe since the job itself contains enough information to render skipped stages in the
     // UI.
-    val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-      s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING
-    }
-
-    val stageIds = stages.map { s =>

Review comment:
       Scratch that - did not see the attempt iteration - makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926481040


   @taroplus @mridulm @Ngone51 @cloud-fan Thanks for the review.
   Merging to master/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org