You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/02 05:54:23 UTC

[GitHub] [spark] tanvn opened a new pull request, #36424: SPARK-39083 : Fix race condition between update and clean app data

tanvn opened a new pull request, #36424:
URL: https://github.com/apache/spark/pull/36424

   ### What changes were proposed in this pull request?
   make `cleanAppData` atomic to prevent race condition between update and clean app data. 
   When the race condition happens, ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`.
   So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist.
   
   #### Background
   Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default.
   - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299
   - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472
   
   In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore:
   - detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578
   - detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600
   These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks.
   https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401
   
   ### When does the bug happen?
   `Application not found` error happens in the following scenario:
   In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore
   - one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`)  and the value is an instance of LogInfo represented the log
     -  https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505
     - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552
   - one entry of key-value as the key is the applicationId (`AAA`) and  the value is an instance of ApplicationInfoWrapper holding the information of the application.
     - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825
     - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172
   
   In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So  `checkForLogs` will do the following 2 things in 2 different threads:
   - Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore
     - add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log
     - updated the entry with key=applicationId (`AAA`)  with new value of  an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true)
   - Thread 2:  data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted.
     - clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600
     - Inside, `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632  For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA)  with new value of  an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thin
 g is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`)  with new value of  an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist.
   
   So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that
   - If Thread 1 gets the lock on `listing` before Thread 2, `isStale` will be false, the app will not be removed from KVStore
   - If Thread 2 gets the lock on `listing` before Thread 1, then `isStale` will be true, the app will be removed from KVStore but after that when it will be added again by Thread 1.
   In both case, the entry for the application will not be deleted unexpectedly from KVStore.
   
   
   ### Why are the changes needed?
   Fix the bug that is happening when the HS Web UI displays `Application not found` for applications whose logs does exist.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ## How was this patch tested?
   Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore.
   `Application not found` error does not happen anymore.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1114972192

   @gengliangwang @srowen @dongjoon-hyun @turboFei @vanzin @HeartSaVioR
   Please take a look when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r866807615


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -631,37 +631,38 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
 
   private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
     try {
-      val app = load(appId)
-      val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
-      assert(attempt.isEmpty || attempt.size == 1)
-      val isStale = attempt.headOption.exists { a =>
-        if (a.logPath != new Path(logPath).getName()) {
-          // If the log file name does not match, then probably the old log file was from an
-          // in progress application. Just return that the app should be left alone.
-          false
-        } else {
-          val maybeUI = synchronized {
-            activeUIs.remove(appId -> attemptId)
-          }
-
-          maybeUI.foreach { ui =>
-            ui.invalidate()
-            ui.ui.store.close()
+      var isStale = false
+      listing.synchronized {
+        val app = load(appId)
+        val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+        assert(attempt.isEmpty || attempt.size == 1)
+        isStale = attempt.headOption.exists { a =>
+          if (a.logPath != new Path(logPath).getName()) {
+            // If the log file name does not match, then probably the old log file was from an
+            // in progress application. Just return that the app should be left alone.
+            false
+          } else {
+            if (others.nonEmpty) {
+              val newAppInfo = new ApplicationInfoWrapper(app.info, others)
+              listing.write(newAppInfo)
+            } else {
+              listing.delete(classOf[ApplicationInfoWrapper], appId)
+            }
+            true
           }
-
-          diskManager.foreach(_.release(appId, attemptId, delete = true))
-          true
         }
       }
 
-      if (isStale) {
-        if (others.nonEmpty) {
-          val newAppInfo = new ApplicationInfoWrapper(app.info, others)
-          listing.write(newAppInfo)
-        } else {
-          listing.delete(classOf[ApplicationInfoWrapper], appId)
+      if(isStale) {

Review Comment:
   Thank you. Just fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116261702

   ```
   If Thread 1 gets the lock on listing before Thread 2, isStale will be false, the app will not be removed from KVStore
   If Thread 2 gets the lock on listing before Thread 1, then isStale will be true, the app will be removed from KVStore but after that it will be added again by Thread 1.
   In both case, the entry for the application will not be deleted unexpectedly from KVStore.
   ```
   @tanvn It seems that the description doesn't match the current implementation. Can you update it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1120217878

   @mridulm @gengliangwang @srowen @thejdeep 
   JFYI, I have deployed the newest version of this patch on my server from yesterday and there is no exception so far. 
   So I think the bug has been fixed and `synchronized` on the `listing` is the point (since the race condition is happening between the threads adding and removing data on `listing`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1120416351

   Merged to master/3.3/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119226177

   @tanvn Thanks for trying the new patch. Yes the current approach doesn't fix all the race condition.
   I take a look at the description again. The method `isProcessing` should only return false when all the updates are done. I wonder why the filtering doesn't work in https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L593


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116268140

   > Can you update it?
   
   @gengliangwang just updated the description. PTAL, thank you.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119277196

   Thanks for the test results @tanvn.
   @thejdeep, can you check if this is an issue in our deployment ? And if yes, what we have done to mitigate it (if any) - else, any thoughts on how to address it ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r866785032


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -631,37 +631,38 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
 
   private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
     try {
-      val app = load(appId)
-      val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
-      assert(attempt.isEmpty || attempt.size == 1)
-      val isStale = attempt.headOption.exists { a =>
-        if (a.logPath != new Path(logPath).getName()) {
-          // If the log file name does not match, then probably the old log file was from an
-          // in progress application. Just return that the app should be left alone.
-          false
-        } else {
-          val maybeUI = synchronized {
-            activeUIs.remove(appId -> attemptId)
-          }
-
-          maybeUI.foreach { ui =>
-            ui.invalidate()
-            ui.ui.store.close()
+      var isStale = false
+      listing.synchronized {
+        val app = load(appId)
+        val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+        assert(attempt.isEmpty || attempt.size == 1)
+        isStale = attempt.headOption.exists { a =>
+          if (a.logPath != new Path(logPath).getName()) {
+            // If the log file name does not match, then probably the old log file was from an
+            // in progress application. Just return that the app should be left alone.
+            false
+          } else {
+            if (others.nonEmpty) {
+              val newAppInfo = new ApplicationInfoWrapper(app.info, others)
+              listing.write(newAppInfo)
+            } else {
+              listing.delete(classOf[ApplicationInfoWrapper], appId)
+            }
+            true
           }
-
-          diskManager.foreach(_.release(appId, attemptId, delete = true))
-          true
         }
       }
 
-      if (isStale) {
-        if (others.nonEmpty) {
-          val newAppInfo = new ApplicationInfoWrapper(app.info, others)
-          listing.write(newAppInfo)
-        } else {
-          listing.delete(classOf[ApplicationInfoWrapper], appId)
+      if(isStale) {

Review Comment:
   Nit: space after if
   This looks OK - finer-grained but not nesting locks now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116450710

   +CC @thejdeep, @shardulm94 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r863132404


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
   }
 
   private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
-    try {
-      val app = load(appId)
-      val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
-      assert(attempt.isEmpty || attempt.size == 1)
-      val isStale = attempt.headOption.exists { a =>
-        if (a.logPath != new Path(logPath).getName()) {
-          // If the log file name does not match, then probably the old log file was from an
-          // in progress application. Just return that the app should be left alone.
-          false
-        } else {
-          val maybeUI = synchronized {
-            activeUIs.remove(appId -> attemptId)
-          }
+    // SPARK-39083 prevent race condition between update and clean app data
+    listing.synchronized {
+      try {
+        val app = load(appId)
+        val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+        assert(attempt.isEmpty || attempt.size == 1)
+        val isStale = attempt.headOption.exists { a =>
+          if (a.logPath != new Path(logPath).getName()) {
+            // If the log file name does not match, then probably the old log file was from an
+            // in progress application. Just return that the app should be left alone.
+            false
+          } else {
+            val maybeUI = synchronized {

Review Comment:
   +1 for @srowen 's advice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1114641958

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r863766221


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
   }
 
   private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
-    try {
-      val app = load(appId)
-      val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
-      assert(attempt.isEmpty || attempt.size == 1)
-      val isStale = attempt.headOption.exists { a =>
-        if (a.logPath != new Path(logPath).getName()) {
-          // If the log file name does not match, then probably the old log file was from an
-          // in progress application. Just return that the app should be left alone.
-          false
-        } else {
-          val maybeUI = synchronized {
-            activeUIs.remove(appId -> attemptId)
-          }
+    // SPARK-39083 prevent race condition between update and clean app data
+    listing.synchronized {
+      try {
+        val app = load(appId)
+        val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+        assert(attempt.isEmpty || attempt.size == 1)
+        val isStale = attempt.headOption.exists { a =>
+          if (a.logPath != new Path(logPath).getName()) {
+            // If the log file name does not match, then probably the old log file was from an
+            // in progress application. Just return that the app should be left alone.
+            false
+          } else {
+            val maybeUI = synchronized {

Review Comment:
   @srowen @dongjoon-hyun 
   Thanks for the advice. Just fixed it. I think it will be a little bit slower but not too slow 🤔 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119204194

   @srowen @gengliangwang @mridulm 
   JFYI, I have deployed this patch on my server 3 days ago and the `java.util.NoSuchElementException` Exception occurred 3 times. So I guess  just using `synchronized` on the code block of `cleanAppData` is not enough. And we should use `synchronized` on `listing` for a better understanding since it is being used in other places. WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119226593

   cc @jixiongxiao-db @rednaxelafx 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119234111

   @gengliangwang cc @srowen 
   I just added some changes to the `cleanAppData` method, I think getting a lock on `listing` when loading and removing the stale app will fix the race condition and also I have moved out the code block for removing from `activeUIs` to avoid nested `synchronized`. PTAL when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r862961041


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
   }
 
   private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
-    try {
-      val app = load(appId)
-      val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
-      assert(attempt.isEmpty || attempt.size == 1)
-      val isStale = attempt.headOption.exists { a =>
-        if (a.logPath != new Path(logPath).getName()) {
-          // If the log file name does not match, then probably the old log file was from an
-          // in progress application. Just return that the app should be left alone.
-          false
-        } else {
-          val maybeUI = synchronized {
-            activeUIs.remove(appId -> attemptId)
-          }
+    // SPARK-39083 prevent race condition between update and clean app data
+    listing.synchronized {
+      try {
+        val app = load(appId)
+        val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+        assert(attempt.isEmpty || attempt.size == 1)
+        val isStale = attempt.headOption.exists { a =>
+          if (a.logPath != new Path(logPath).getName()) {
+            // If the log file name does not match, then probably the old log file was from an
+            // in progress application. Just return that the app should be left alone.
+            false
+          } else {
+            val maybeUI = synchronized {

Review Comment:
   The nested synchronized makes me uneasy - risks a deadlock. Why not just `synchronized` on everything? safe, but is it too slow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119230516

   @gengliangwang 
   Thank you for taking your time.
   Regarding `isProcessing ` method, in the scenario I wrote in the PR summary:
   In the next run of checkForLogs, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. 
   In this case, `viewfs://iu/log/spark3/AAA_1` will be marked as `processing`.
   https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1391
   but the `stale` is taking data from the `listing` map, 
   https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L592
   which means it will contain `viewfs://iu/log/spark3/AAA_1.inprogress` (as it is added to the `listing` in the previous execution of `checkForLogs`) and `isProcessing` for `viewfs://iu/log/spark3/AAA_1.inprogress` will return false, so `cleanAppData` will be executed for it (which I think there is no problem)
    
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data
URL: https://github.com/apache/spark/pull/36424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org