You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/12 03:33:48 UTC

[GitHub] [spark] thejdeep commented on a diff in pull request #38983: [SPARK-41447][CORE] Clean up expired event log files that don't exist in listing db

thejdeep commented on code in PR #38983:
URL: https://github.com/apache/spark/pull/38983#discussion_r1045377745


##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -272,17 +272,17 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
 
     // Disable the background thread during tests.
     if (!conf.contains(IS_TESTING)) {
-      // A task that periodically checks for event log updates on disk.
-      logDebug(s"Scheduling update thread every $UPDATE_INTERVAL_S seconds")
-      pool.scheduleWithFixedDelay(
-        getRunner(() => checkForLogs()), 0, UPDATE_INTERVAL_S, TimeUnit.SECONDS)
-
       if (conf.get(CLEANER_ENABLED)) {
         // A task that periodically cleans event logs on disk.
         pool.scheduleWithFixedDelay(
           getRunner(() => cleanLogs()), 0, CLEAN_INTERVAL_S, TimeUnit.SECONDS)
       }
 
+      // A task that periodically checks for event log updates on disk.

Review Comment:
   We can only control this interleaving to a certain extent. For example, when `UPDATE_INTERVAL_S` is configured to be less than `CLEAN_INTERVAL_S` (which is usually the case), we are still going to be performing the `checkForLogs` before we `cleanLogs`. 



##########
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:
##########
@@ -977,6 +977,20 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
         listing.delete(classOf[LogInfo], log.logPath)
       }
     }
+    // Delete log files that don't exist in listing database and exceed the configured max age.
+    Option(fs.listStatus(new Path(logDir))).map(_.toSeq).getOrElse(Nil)

Review Comment:
   I think the problem with this is - what would happen if new event logs are added into the event logging directory and the `cleanLogs` thread runs before the `checkForLogs` thread runs, then we end up deleting the newly added event logs. Is my understanding right ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org