You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/02 05:54:23 UTC
[GitHub] [spark] tanvn opened a new pull request, #36424: SPARK-39083 : Fix race condition between update and clean app data
tanvn opened a new pull request, #36424:
URL: https://github.com/apache/spark/pull/36424
### What changes were proposed in this pull request?
make `cleanAppData` atomic to prevent race condition between update and clean app data.
When the race condition happens, ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`.
So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist.
#### Background
Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default.
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472
In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore:
- detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578
- detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600
These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks.
https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401
### When does the bug happen?
`Application not found` error happens in the following scenario:
In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore
- one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`) and the value is an instance of LogInfo represented the log
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552
- one entry of key-value as the key is the applicationId (`AAA`) and the value is an instance of ApplicationInfoWrapper holding the information of the application.
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825
- https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172
In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So `checkForLogs` will do the following 2 things in 2 different threads:
- Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore
- add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log
- updated the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true)
- Thread 2: data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted.
- clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600
- Inside, `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632 For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA) with new value of an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thin
g is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist.
So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that
- If Thread 1 gets the lock on `listing` before Thread 2, `isStale` will be false, the app will not be removed from KVStore
- If Thread 2 gets the lock on `listing` before Thread 1, then `isStale` will be true, the app will be removed from KVStore but after that when it will be added again by Thread 1.
In both case, the entry for the application will not be deleted unexpectedly from KVStore.
### Why are the changes needed?
Fix the bug that is happening when the HS Web UI displays `Application not found` for applications whose logs does exist.
### Does this PR introduce _any_ user-facing change?
No
## How was this patch tested?
Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore.
`Application not found` error does not happen anymore.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1114972192
@gengliangwang @srowen @dongjoon-hyun @turboFei @vanzin @HeartSaVioR
Please take a look when you have time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r866807615
##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -631,37 +631,38 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
try {
- val app = load(appId)
- val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
- assert(attempt.isEmpty || attempt.size == 1)
- val isStale = attempt.headOption.exists { a =>
- if (a.logPath != new Path(logPath).getName()) {
- // If the log file name does not match, then probably the old log file was from an
- // in progress application. Just return that the app should be left alone.
- false
- } else {
- val maybeUI = synchronized {
- activeUIs.remove(appId -> attemptId)
- }
-
- maybeUI.foreach { ui =>
- ui.invalidate()
- ui.ui.store.close()
+ var isStale = false
+ listing.synchronized {
+ val app = load(appId)
+ val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+ assert(attempt.isEmpty || attempt.size == 1)
+ isStale = attempt.headOption.exists { a =>
+ if (a.logPath != new Path(logPath).getName()) {
+ // If the log file name does not match, then probably the old log file was from an
+ // in progress application. Just return that the app should be left alone.
+ false
+ } else {
+ if (others.nonEmpty) {
+ val newAppInfo = new ApplicationInfoWrapper(app.info, others)
+ listing.write(newAppInfo)
+ } else {
+ listing.delete(classOf[ApplicationInfoWrapper], appId)
+ }
+ true
}
-
- diskManager.foreach(_.release(appId, attemptId, delete = true))
- true
}
}
- if (isStale) {
- if (others.nonEmpty) {
- val newAppInfo = new ApplicationInfoWrapper(app.info, others)
- listing.write(newAppInfo)
- } else {
- listing.delete(classOf[ApplicationInfoWrapper], appId)
+ if(isStale) {
Review Comment:
Thank you. Just fixed it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116261702
```
If Thread 1 gets the lock on listing before Thread 2, isStale will be false, the app will not be removed from KVStore
If Thread 2 gets the lock on listing before Thread 1, then isStale will be true, the app will be removed from KVStore but after that it will be added again by Thread 1.
In both case, the entry for the application will not be deleted unexpectedly from KVStore.
```
@tanvn It seems that the description doesn't match the current implementation. Can you update it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1120217878
@mridulm @gengliangwang @srowen @thejdeep
JFYI, I have deployed the newest version of this patch on my server from yesterday and there is no exception so far.
So I think the bug has been fixed and `synchronized` on the `listing` is the point (since the race condition is happening between the threads adding and removing data on `listing`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1120416351
Merged to master/3.3/3.2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119226177
@tanvn Thanks for trying the new patch. Yes the current approach doesn't fix all the race condition.
I take a look at the description again. The method `isProcessing` should only return false when all the updates are done. I wonder why the filtering doesn't work in https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L593
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116268140
> Can you update it?
@gengliangwang just updated the description. PTAL, thank you.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119277196
Thanks for the test results @tanvn.
@thejdeep, can you check if this is an issue in our deployment ? And if yes, what we have done to mitigate it (if any) - else, any thoughts on how to address it ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r866785032
##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -631,37 +631,38 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
try {
- val app = load(appId)
- val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
- assert(attempt.isEmpty || attempt.size == 1)
- val isStale = attempt.headOption.exists { a =>
- if (a.logPath != new Path(logPath).getName()) {
- // If the log file name does not match, then probably the old log file was from an
- // in progress application. Just return that the app should be left alone.
- false
- } else {
- val maybeUI = synchronized {
- activeUIs.remove(appId -> attemptId)
- }
-
- maybeUI.foreach { ui =>
- ui.invalidate()
- ui.ui.store.close()
+ var isStale = false
+ listing.synchronized {
+ val app = load(appId)
+ val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+ assert(attempt.isEmpty || attempt.size == 1)
+ isStale = attempt.headOption.exists { a =>
+ if (a.logPath != new Path(logPath).getName()) {
+ // If the log file name does not match, then probably the old log file was from an
+ // in progress application. Just return that the app should be left alone.
+ false
+ } else {
+ if (others.nonEmpty) {
+ val newAppInfo = new ApplicationInfoWrapper(app.info, others)
+ listing.write(newAppInfo)
+ } else {
+ listing.delete(classOf[ApplicationInfoWrapper], appId)
+ }
+ true
}
-
- diskManager.foreach(_.release(appId, attemptId, delete = true))
- true
}
}
- if (isStale) {
- if (others.nonEmpty) {
- val newAppInfo = new ApplicationInfoWrapper(app.info, others)
- listing.write(newAppInfo)
- } else {
- listing.delete(classOf[ApplicationInfoWrapper], appId)
+ if(isStale) {
Review Comment:
Nit: space after if
This looks OK - finer-grained but not nesting locks now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1116450710
+CC @thejdeep, @shardulm94
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r863132404
##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
}
private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
- try {
- val app = load(appId)
- val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
- assert(attempt.isEmpty || attempt.size == 1)
- val isStale = attempt.headOption.exists { a =>
- if (a.logPath != new Path(logPath).getName()) {
- // If the log file name does not match, then probably the old log file was from an
- // in progress application. Just return that the app should be left alone.
- false
- } else {
- val maybeUI = synchronized {
- activeUIs.remove(appId -> attemptId)
- }
+ // SPARK-39083 prevent race condition between update and clean app data
+ listing.synchronized {
+ try {
+ val app = load(appId)
+ val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+ assert(attempt.isEmpty || attempt.size == 1)
+ val isStale = attempt.headOption.exists { a =>
+ if (a.logPath != new Path(logPath).getName()) {
+ // If the log file name does not match, then probably the old log file was from an
+ // in progress application. Just return that the app should be left alone.
+ false
+ } else {
+ val maybeUI = synchronized {
Review Comment:
+1 for @srowen 's advice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1114641958
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r863766221
##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
}
private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
- try {
- val app = load(appId)
- val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
- assert(attempt.isEmpty || attempt.size == 1)
- val isStale = attempt.headOption.exists { a =>
- if (a.logPath != new Path(logPath).getName()) {
- // If the log file name does not match, then probably the old log file was from an
- // in progress application. Just return that the app should be left alone.
- false
- } else {
- val maybeUI = synchronized {
- activeUIs.remove(appId -> attemptId)
- }
+ // SPARK-39083 prevent race condition between update and clean app data
+ listing.synchronized {
+ try {
+ val app = load(appId)
+ val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+ assert(attempt.isEmpty || attempt.size == 1)
+ val isStale = attempt.headOption.exists { a =>
+ if (a.logPath != new Path(logPath).getName()) {
+ // If the log file name does not match, then probably the old log file was from an
+ // in progress application. Just return that the app should be left alone.
+ false
+ } else {
+ val maybeUI = synchronized {
Review Comment:
@srowen @dongjoon-hyun
Thanks for the advice. Just fixed it. I think it will be a little bit slower but not too slow 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119204194
@srowen @gengliangwang @mridulm
JFYI, I have deployed this patch on my server 3 days ago and the `java.util.NoSuchElementException` Exception occurred 3 times. So I guess just using `synchronized` on the code block of `cleanAppData` is not enough. And we should use `synchronized` on `listing` for a better understanding since it is being used in other places. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119226593
cc @jixiongxiao-db @rednaxelafx
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119234111
@gengliangwang cc @srowen
I just added some changes to the `cleanAppData` method, I think getting a lock on `listing` when loading and removing the stale app will fix the race condition and also I have moved out the code block for removing from `activeUIs` to avoid nested `synchronized`. PTAL when you have time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36424:
URL: https://github.com/apache/spark/pull/36424#discussion_r862961041
##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -630,41 +630,44 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
}
private def cleanAppData(appId: String, attemptId: Option[String], logPath: String): Unit = {
- try {
- val app = load(appId)
- val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
-
- assert(attempt.isEmpty || attempt.size == 1)
- val isStale = attempt.headOption.exists { a =>
- if (a.logPath != new Path(logPath).getName()) {
- // If the log file name does not match, then probably the old log file was from an
- // in progress application. Just return that the app should be left alone.
- false
- } else {
- val maybeUI = synchronized {
- activeUIs.remove(appId -> attemptId)
- }
+ // SPARK-39083 prevent race condition between update and clean app data
+ listing.synchronized {
+ try {
+ val app = load(appId)
+ val (attempt, others) = app.attempts.partition(_.info.attemptId == attemptId)
+
+ assert(attempt.isEmpty || attempt.size == 1)
+ val isStale = attempt.headOption.exists { a =>
+ if (a.logPath != new Path(logPath).getName()) {
+ // If the log file name does not match, then probably the old log file was from an
+ // in progress application. Just return that the app should be left alone.
+ false
+ } else {
+ val maybeUI = synchronized {
Review Comment:
The nested synchronized makes me uneasy - risks a deadlock. Why not just `synchronized` on everything? safe, but is it too slow?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] tanvn commented on pull request #36424: [SPARK-39083][CORE] : Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
tanvn commented on PR #36424:
URL: https://github.com/apache/spark/pull/36424#issuecomment-1119230516
@gengliangwang
Thank you for taking your time.
Regarding `isProcessing ` method, in the scenario I wrote in the PR summary:
In the next run of checkForLogs, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created.
In this case, `viewfs://iu/log/spark3/AAA_1` will be marked as `processing`.
https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1391
but the `stale` is taking data from the `listing` map,
https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L592
which means it will contain `viewfs://iu/log/spark3/AAA_1.inprogress` (as it is added to the `listing` in the previous execution of `checkForLogs`) and `isProcessing` for `viewfs://iu/log/spark3/AAA_1.inprogress` will return false, so `cleanAppData` will be executed for it (which I think there is no problem)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen closed pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data
Posted by GitBox <gi...@apache.org>.
srowen closed pull request #36424: [SPARK-39083][CORE] Fix race condition between update and clean app data
URL: https://github.com/apache/spark/pull/36424
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org