You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/02 23:30:36 UTC

[GitHub] [spark] HeartSaVioR commented on a change in pull request #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink

HeartSaVioR commented on a change in pull request #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#discussion_r352915190
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ##########
 @@ -267,10 +269,22 @@ class FileStreamSource(
     val logOffset = FileStreamSourceOffset(end).logOffset
 
     sourceCleaner.foreach { cleaner =>
-      val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2)
-      val validFileEntities = files.filter(_.batchId == logOffset)
-      logDebug(s"completed file entries: ${validFileEntities.mkString(",")}")
-      validFileEntities.foreach(cleaner.clean)
+      sourceHasMetadata match {
+        case Some(true) if !warnedIgnoringCleanSourceOption =>
+          logWarning("Ignoring 'cleanSource' option since source path refers to the output" +
+            " directory of FileStreamSink.")
+          warnedIgnoringCleanSourceOption = true
+
+        case Some(false) =>
+          val files = metadataLog.get(Some(logOffset), Some(logOffset)).flatMap(_._2)
+          val validFileEntities = files.filter(_.batchId == logOffset)
+          logDebug(s"completed file entries: ${validFileEntities.mkString(",")}")
+          validFileEntities.foreach(cleaner.clean)
+
+        case _ =>
+          logWarning("Ignoring 'cleanSource' option since Spark hasn't figured out whether " +
 
 Review comment:
   Ah OK I guess I got your point now. I'm also in favor of being "fail-fast" and the suggestion fits it. Thanks! Just updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org