You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/24 18:27:27 UTC

[GitHub] [spark] xuanyuanking opened a new pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

xuanyuanking opened a new pull request #31638:
URL: https://github.com/apache/spark/pull/31638


   ### What changes were proposed in this pull request?
   - Add a flag `spark.sql.streaming.fileSink.formatCheck.enabled` to skip the format check for file streaming sink. This can be turned off when the user wants to read the directory as a batch output.
   - When checking a glob path throws an exception, we will assume the user wants to read a batch output.
   
   ### Why are the changes needed?
   - Some users may use a very long glob path to read and `isDirectory` may fail when the path is too long. We should ignore the error when the path is a glob path since the file streaming sink doesn’t support glob paths.
   - Checking whether a directory is outputted by File Streaming Sink may fail for various issues happening in the storage. We should add a flag to allow users to disable the checking logic and read the directory as a batch output.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   - The long glob path will not throw an exception when checking file sink format
   - Add a new flag `spark.sql.streaming.fileSink.formatCheck.enabled` to control the metadata checking logic.
   
   
   ### How was this patch tested?
   New UT added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807319959


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136525/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833358042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42720/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-806839568


   **[Test build #136525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136525/testReport)** for PR 31638 at commit [`11f6bf0`](https://github.com/apache/spark/commit/11f6bf0adfa2bfec1b12f0262f68c5b5a8868eee).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823156542


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42210/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824088615


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137734/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785296286


   **[Test build #135434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135434/testReport)** for PR 31638 at commit [`743b19b`](https://github.com/apache/spark/commit/743b19b2198c6ea16b64f42c97ac2809efc94967).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833538053


   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824007338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137726/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808613061


   OK I agree with the direction on setting goal as fixing "regression" only if we have no clear solution to fix the root cause or overall problem. 
   
   Just a one thing I think we'd like to make clear is, the flag on behavior. As we are sure we are fixing regression here, I wonder we need to introduce a flag for that (that was my concern on initial shape of the PR); would someone want to deny fixing regression and leave the behavior as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823938868


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42253/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823153925


   **[Test build #137682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137682/testReport)** for PR 31638 at commit [`7c40405`](https://github.com/apache/spark/commit/7c404057827b679bbd47f5a01e5994c8ccabf471).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823942366


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42253/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824003876


   **[Test build #137726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137726/testReport)** for PR 31638 at commit [`8bd9297`](https://github.com/apache/spark/commit/8bd9297da4a0873e0e0bfbc420d215918eae6e96).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596922892



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
##########
@@ -575,6 +576,43 @@ abstract class FileStreamSinkSuite extends StreamTest {
       }
     }
   }
+
+  test("formatCheck flag") {
+    withSQLConf(
+      "fs.file.impl" -> classOf[FailFormatCheckFileSystem].getName,

Review comment:
       Thanks for the guidance, let me check the usage of FileSystemTestHelper.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-790270886


   Yep, let's wait for more voices.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807319959


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136525/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-812224204


   I guess we still don't decide with the following:
   
   > Just a one thing I think we'd like to make clear is, the flag on behavior. As we are sure we are fixing regression here, I wonder we need to introduce a flag for that (that was my concern on initial shape of the PR); would someone want to deny fixing regression and leave the behavior as it is?
   
   cc. @zsxwing 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833301152






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r619916834



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +43,20 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        try {
+          val fs = hdfsPath.getFileSystem(hadoopConf)
+          if (fs.isDirectory(hdfsPath)) {
+            val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+            fs.exists(metadataPath)
+          } else {
+            false
+          }
+        } catch {
+          case e: SparkException => throw e

Review comment:
       This looks to be different from 2.4:
   
   https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
   
   Do you have specific exceptions to catch here? I can't imagine any SparkException from above logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802038630


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40794/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596927316



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Thanks for the summary. I addressed in d7abb94.
   Also, I tried to add the config in the exception branch and error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-811959583


   **[Test build #136822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136822/testReport)** for PR 31638 at commit [`207d063`](https://github.com/apache/spark/commit/207d063a99b12aec6bb6019d730577d04fcafbeb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833350382


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42720/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824050253


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42261/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594073932



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       @sunchao 
   Thanks for the answer! That is same as my expectation.
   
   @xuanyuanking 
   According to the answer, I don't think there's a risk on dealing glob path as wrong input and just returning false. It has been working like so, plus possibility to throw additional exception. I don't think we need to add flag to keep the behavior - we can just fix it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802036977


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40794/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r586286229



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Let's focus on "glob path" here - from the quick look on org.apache.hadoop.fs.FileSystem javadoc, both `isDirectory()` and `exists()` seem to require the exact path, not glob path. The method which accept glob path is `globStatus()`, and there the API clearly names the parameter as `pathPattern`.
   
   https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html
   
   And intuitively, it sounds me as very odd if someone can say the glob path is a directory while matching paths can be both files and directories, and same for checking existence as well. I don't think it's feasible to "expect" the meaningful value from calling these methods with glob path. That sounds to me as "undefined behavior" even any weird file system could return true for the case.
   
   That said, I'd rather consider the input of glob path as "wrong one" and always return false (some sort of debug log message is fine if we really like to log). If there's a code relying on such behavior, I don't think that is correct. I'd rather say the possible paths should be populated before, and this method should be called per each path.
   
   I still need to hear voices from others, but if the consensus goes to just disallow the glob path here, we won't need to introduce the new configuration.
   
   @tdas @zsxwing @jose-torres @viirya @gaborgsomogyi Would like to hear your voices on this. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-811989770


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823942325


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42253/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zsxwing commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
zsxwing commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807747189


   I think the issue we try to fix is when a glob path is valid but we cannot call `getFileStatus` with it, how to allow users to access batch output.
   
   For example, a glob path can be very long such as `/foo/bar/{20200101,20200102,...20201231}` which cannot get accepted by `getFileStatus` because most of cloud storages won't accept such a long path. However, after we expend the glob path to real file paths, these paths are valid paths.
   
   In 2.4, we blindly ignore errors when checking whether a directory has `_metadata_log` or not. So the above case works in 2.4. However, in 3.0, we don't ignore errors any more, and it exposes this issue.
   
   In addition, a glob path can be a normal path, which makes it much more complicated. For example,
   
   ```
     test("foo") {
       withTempDir { tempDir =>
         val path = new java.io.File(tempDir, "[]")
         val chk = new java.io.File(tempDir, "chk")
         val q = spark.readStream.format("rate").load().writeStream.format("parquet")
           .trigger(org.apache.spark.sql.streaming.Trigger.Once())
           .option("checkpointLocation", chk.getCanonicalPath)
           .start(path.getCanonicalPath)
         q.awaitTermination()
         assert(SparkHadoopUtil.get.isGlobPath(new Path(path.getCanonicalPath)))
         val fileIndex = spark.read.format("parquet").load(path.getCanonicalPath)
           .queryExecution.executedPlan.collectFirst {
           case f: FileSourceScanExec => f.relation.location
         }.head
         assert(fileIndex.isInstanceOf[MetadataLogFileIndex])
       }
     }
   ```
   
   Ideally, batch queries should not be impacted by the extra code from streaming queries. But unfortunately that's impossible now. I'm inclined to do the initial approach which only skip glob paths when an error is thrown.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833318356


   **[Test build #138199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138199/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r586286229



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Let's focus on "glob path" here - from the quick look on org.apache.hadoop.fs.FileSystem javadoc, both `isDirectory()` and `exists()` seem to require the exact path, not glob path. The method which accept glob path is `globStatus()`, and there the API clearly names the parameter as `pathPattern`.
   
   https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html
   
   And intuitively, it sounds very odd to me if someone can say the glob path is a directory while matching paths can be both files and directories, and same for checking existence as well. I don't think it's feasible to "expect" the meaningful value from calling these methods with glob path. That sounds to me as "undefined behavior" even any weird file system could return true for the case.
   
   That said, I'd rather consider the input of glob path as "wrong one" and always return false (some sort of debug log message is fine if we really like to log). If there's a code relying on such behavior, I don't think that is correct. I'd rather say the possible paths should be populated before, and this method should be called per each path.
   
   I still need to hear voices from others, but if the consensus goes to just disallow the glob path here, we won't need to introduce the new configuration.
   
   @tdas @zsxwing @jose-torres @viirya @gaborgsomogyi Would like to hear your voices on this. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
steveloughran commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594455319



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)
+            } else {
+              false
+            }
+          } catch {
+            case NonFatal(e) if SparkHadoopUtil.get.isGlobPath(hdfsPath) =>
+              // If this is a glob path, the failure is likely because the path is too long.
+              // Since file streaming sink doesn't support glob path, we can assume there is no
+              // metadata directory.
+              logWarning(s"Assume no metadata directory. Error while looking for " +

Review comment:
       be handy to include the path in question here, as there may be other issues (permissions etc)

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)

Review comment:
       does this need to be a file/dir, or is existence all that matters?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
##########
@@ -575,6 +576,43 @@ abstract class FileStreamSinkSuite extends StreamTest {
       }
     }
   }
+
+  test("formatCheck flag") {
+    withSQLConf(
+      "fs.file.impl" -> classOf[FailFormatCheckFileSystem].getName,
+      "fs.file.impl.disable.cache" -> "true") {
+      withTempDir { tempDir =>
+        val path = new File(tempDir, "text").getCanonicalPath
+        Seq("foo").toDF.write.format("text").save(path)
+        def assertFail(): Unit = {
+          val e = intercept[IOException] {
+            spark.read.format("text").load(path)
+          }
+          assert(e.getMessage == "cannot access metadata log")

Review comment:
       === ? or a .contains()?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
##########
@@ -575,6 +576,43 @@ abstract class FileStreamSinkSuite extends StreamTest {
       }
     }
   }
+
+  test("formatCheck flag") {
+    withSQLConf(
+      "fs.file.impl" -> classOf[FailFormatCheckFileSystem].getName,

Review comment:
       not needed here, but it's worth knowing that hadoop-common-test jar has a class `org.apache.hadoop.fs.FileSystemTestHelper` that lets you inject an instance of an FS into the cache. Very useful from time to time. Spark could either use that  or add its own (test module) class to o.a.h.fs and call `FileSystem.addFileSystemForTesting()` when needed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r600113639



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       @xuanyuanking 
   Sorry I didn't realize you've updated this PR.
   
   As I said earlier, if we want to ignore metadata for file sink output directory, that should be added to "source option". Let's deal with another JIRA issue / PR so that we can continue discussion afterwards.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785462318


   **[Test build #135434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135434/testReport)** for PR 31638 at commit [`743b19b`](https://github.com/apache/spark/commit/743b19b2198c6ea16b64f42c97ac2809efc94967).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r584216596



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       I think this can be simplified after checking "`isDirectory()` and `exists()` work with glob path?".
   
   If these methods don't support glob path (and I expect they don't), we can check glob path in prior instead of dealing with exception. That would also removes the necessary of context to know "these methods can fail when the path is glob and the path is too long".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r623872360



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +43,20 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        try {
+          val fs = hdfsPath.getFileSystem(hadoopConf)
+          if (fs.isDirectory(hdfsPath)) {
+            val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+            fs.exists(metadataPath)
+          } else {
+            false
+          }
+        } catch {
+          case e: SparkException => throw e

Review comment:
       This is added for fixing the UT `org.apache.spark.sql.streaming.StreamingQuerySuite.detect escaped path and report the migration guide`, which expects to have a SparkException thrown in `checkEscapedMetadataPath`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823153961


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137682/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823149624


   **[Test build #137682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137682/testReport)** for PR 31638 at commit [`7c40405`](https://github.com/apache/spark/commit/7c404057827b679bbd47f5a01e5994c8ccabf471).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802004242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833501326


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138199/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833358042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42720/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808072941


   The glob path is valid for the path of data source, but I'm not sure I can agree it's also valid for the parameter of `FileStreamSink.hasMetadata()`.
   
   I couldn't imagine the "correct" behavior and return value when the glob path is provided. Let's say there're `/output/a` and `/output/b`, and only `/output/b` was created with streaming query so having metadata directory.
   
   When we provide `/output/*` as a glob path on path, what would we expect? I see three possible approaches:
   
   1) Leverage metadata in `/output/b` for reading `/output/b` and read `/output/a` via listing. Sounds ideal but not sure Spark now does it.
   2) Leverage metadata in `/output/b` for reading `/output/b`. `/output/a` is silently ignored.
   3) Don't leverage metadata in `/output/b` and read both directories via listing. 
   
   Not sure which one Spark does now, but one clear thing for me is that reasoning the return value of `FileStreamSink.hasMetadata("/output/*")` for above case is very hard if it doesn't work the way it always returns false for glob path. If Spark populates the valid paths from glob path and handles these paths individually (that said, the method is called with non-glob path) the result would be pretty clear. Otherwise, I'm confused what we are expecting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r624358951



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +43,20 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        try {
+          val fs = hdfsPath.getFileSystem(hadoopConf)
+          if (fs.isDirectory(hdfsPath)) {
+            val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+            fs.exists(metadataPath)
+          } else {
+            false
+          }
+        } catch {
+          case e: SparkException => throw e

Review comment:
       Ah OK I've missed that `checkEscapedMetadataPath` throws SparkException. My bad.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823156676


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42210/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823156676


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42210/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-811998230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823151934


   **[Test build #137683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137683/testReport)** for PR 31638 at commit [`1c26b2a`](https://github.com/apache/spark/commit/1c26b2a93c412559897d63229727315ff493fc98).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-806923189


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r585636401



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       It's a really reasonable concern. Let me share more details about the context:
   - It's a regression after SPARK-26824 (you may find after the fix we move the `isDirectory` and `exists` checking out of try catch block). In our user case, the exception was thrown only with the long glob path. The same code passed in Spark 2.4.
   - The exception was thrown of the checking `isDirecoty`. However, we might not be sure that it's always the same behavior since the `FileSystem` has different implementation in different systems (we met this in a non-hdfs file system).
   
   So based on the two points above, I just chose a safer way to fix this issue. The current fix, a new config and dealing with the glob path exception, should carefully back to the behavior before SPARK-26824, and without new behavior changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824072688


   **[Test build #137734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137734/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802038630


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40794/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-817454718


   As we agreed upon, let's focus fixing the regression only. The flag brings side-effect which is actually a new functionality, ignoring `_spark_metadata`, which isn't related to regression. If we think the new functionality is worth having, let's provide the functionality as a new source option, not a flag.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-788974534


   > The behavioral change is bound to file data source, right? I prefer adding the source option instead of adding config, because 1) Spark has a bunch of configurations already 2) I'd prefer having smaller range of impact, per source instead of session-wide.
   
   Agree. Especially for the 1), fully agree we should carefully to add new configs only when it's really needed. It should be my fault that didn't provide more context as mentioned in https://github.com/apache/spark/pull/31638#discussion_r585636401. Actually the same user code can pass in version 2.4 but fail now. If we add a source option, code changes is needed for them on controlling the behavior. It might not be a user friendly fix IMO. If this is a new behavior, I totally will follow your suggestion to add an option instead of config.
   
   > I'd love to see the proper notice on such case as well since we are here.
   
   Yes, thanks for the reminding. I'll add warning log for the cases you mentioned in proper places (like the `hasMetadata` return false) as well as SS user guides (do you think user guides is a also a reasonable place?). Do you prefer to do this in the current PR or a separated one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r601526083



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,35 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+            // Since file streaming sink doesn't support glob path, we can assume there is no
+            // metadata directory.
+            return false
+          }
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)
+            } else {
+              false
+            }
+          } catch {
+            case NonFatal(e) =>
+              logError(s"Error while looking for metadata directory in the path: $singlePath. " +
+                "If we want to skip the check and assume the path is not generated by the file " +

Review comment:
       Sure, done in 11f6bf0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823149624


   **[Test build #137682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137682/testReport)** for PR 31638 at commit [`7c40405`](https://github.com/apache/spark/commit/7c404057827b679bbd47f5a01e5994c8ccabf471).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833395840






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r584216596



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       I think this can be simplified after checking "`isDirectory()` and `exists()` work with glob path?".
   
   If these methods don't support glob path (and I expect they don't), we can check glob path in prior instead of dealing with exception. That would also removes the necessity of context to know "these methods can fail when the path is glob and the path is too long".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807033951


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833478498


   **[Test build #138199 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138199/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
steveloughran commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594453559



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Only FS API call which does glob expansion is globStatus. I don't really think that API should have been in the FS really -it's more a layer above- but it's there and has the treewalking scale problems on cloud stores to match. But, it's there.... Everything else takes the path as is.
   Even there, handling of "non valid" chars in a path is trouble. There's consensus that "/" is a forbidden filename, but how to handle ":"? Long standing source of pain




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823155907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833359841


   **[Test build #138204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138204/testReport)** for PR 31638 at commit [`fa42b0f`](https://github.com/apache/spark/commit/fa42b0fb48270ba75e454f37b7ae654edd296696).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r606227358



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1681,6 +1681,15 @@ object SQLConf {
       .timeConf(TimeUnit.MILLISECONDS)
       .createWithDefault(TimeUnit.MINUTES.toMillis(10)) // 10 minutes
 
+  val FILE_SINK_FORMAT_CHECK_ENABLED =

Review comment:
       Do we really need another flag for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r602650907



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +44,23 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {

Review comment:
       I'm just thinking out loud; I'm OK with fixing regression first, and revisit if we still would like to make the behavior clear to reason about, despite of the fact we might break the existing behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823153961


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137682/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785283603


   cc @zsxwing @HeartSaVioR @bozhang2820 @viirya @Ngone51 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833512308


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138204/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823151538


   @HeartSaVioR Agree. Updated has done. Let's fix the regression in this PR first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807033951


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823158491


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42211/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824050253


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42261/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824016386


   **[Test build #137734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137734/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833512308


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138204/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824007338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137726/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r601525804



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       ```
   Let's deal with another JIRA issue / PR so that we can continue discussion afterwards.
   ```
   Yep, thanks for the advice, done in 11f6bf0.
   (actually that's what I mean in `I'll create another PR for the behavior changing caused by SPARK-26824`. I think I didn't express clear enough...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785463113


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135434/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833501326


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138199/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r607632134



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1681,6 +1681,15 @@ object SQLConf {
       .timeConf(TimeUnit.MILLISECONDS)
       .createWithDefault(TimeUnit.MINUTES.toMillis(10)) // 10 minutes
 
+  val FILE_SINK_FORMAT_CHECK_ENABLED =

Review comment:
       I think so. As the comment in https://github.com/apache/spark/pull/31638#discussion_r585636401, it's a behavior change introduced after SPARK-26824, which is already published in Spark 3.0. So a new flag for this might be a proper way to keep both behaviors with flexibility. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-811998230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41402/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-812224204


   I guess we still don't decide with the following:
   
   > Just a one thing I think we'd like to make clear is, the flag on behavior. As we are sure we are fixing regression here, I wonder we need to introduce a flag for that (that was my concern on initial shape of the PR); would someone want to deny fixing regression and leave the behavior as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r593979867



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       cc. @steveloughran @sunchao as they could provide great inputs on Hadoop filesystem's point of view.
   
   The summary of comment thread - is the behavior defined on providing "glob path" to `isDirectory()` and `exists()`? If defined, what is the expected behavior?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823155869


   **[Test build #137683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137683/testReport)** for PR 31638 at commit [`1c26b2a`](https://github.com/apache/spark/commit/1c26b2a93c412559897d63229727315ff493fc98).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-811948288


   Thanks for the help and discussion. I revived the first commit with more logs following the recent comment. Please check whether it makes sense for you now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-789613489


   OK now I understand the rationalization of the PR. I agree that it shouldn't be a new source option if we just want to fix the exposed problem rather than making it as valid option. I was feeling some sort of adding such behavior as new available option, but now looks like it's not.
   
   I'm still concerned that we are living with huge number of flags; I understand that any behavior fix has a risk that some workload could be broken, but we also need to explore all options to see if we can fix the issue without introducing the flag. Let's hear some more voices to see whether we can fix this safely, or have to deal with flag.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594940759



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Great thanks @sunchao @steveloughran for the great inputs! Thanks @HeartSaVioR for the help!
   ```
   Internally they all call getFileStatus which returns information for a single file
   ```
   @sunchao That's right. The issue we are facing is the non-hdfs file system implementation of `getFileStatus` throws an exception for long globbing format. However, before SPARK-26824, Spark caught all exceptions here and just return false. After SPARK-26824, the same code failed and the user complained it's a regression.
   
   @steveloughran yep, that's the point. The problematic one is a cloud store's implementation of a FileSystem...
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zsxwing commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
zsxwing commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808413955


   > When we provide `/output/*` as a glob path on path, what would we expect?
   
   Currently Spark will `Don't leverage metadata in /output/b and read both directories via listing.`. Ideally we should respect metadata in each directory. But I cannot find a simple way to solve it.
   
   Ideally, we should provide different APIs for normal path and glob path. But since we have mixed two different concepts into one parameter, we cannot take back it. I'd suggest to set the goal to fix the regression from 3.0: `when a glob path is valid but we cannot call getFileStatus with it, how to allow users to access batch output.` and avoid changing any other existing behaviors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r600114628



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,35 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {
+            // Since file streaming sink doesn't support glob path, we can assume there is no
+            // metadata directory.
+            return false
+          }
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)
+            } else {
+              false
+            }
+          } catch {
+            case NonFatal(e) =>
+              logError(s"Error while looking for metadata directory in the path: $singlePath. " +
+                "If we want to skip the check and assume the path is not generated by the file " +

Review comment:
       Once we exclude the functionality on ignoring metadata in this PR, please also remove the sentence as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r619916834



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +43,20 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        try {
+          val fs = hdfsPath.getFileSystem(hadoopConf)
+          if (fs.isDirectory(hdfsPath)) {
+            val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+            fs.exists(metadataPath)
+          } else {
+            false
+          }
+        } catch {
+          case e: SparkException => throw e

Review comment:
       This looks to be different from 2.4:
   
   https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
   
   Do you have specific exceptions to catch here? I can't imagine any SparkException from above logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824088615


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137734/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r611302985



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -1681,6 +1681,15 @@ object SQLConf {
       .timeConf(TimeUnit.MILLISECONDS)
       .createWithDefault(TimeUnit.MINUTES.toMillis(10)) // 10 minutes
 
+  val FILE_SINK_FORMAT_CHECK_ENABLED =

Review comment:
       I'm not in favor of having flags and would like to avoid adding flag at all. When we add a flag, we are adding "branch" to maintenance. The purpose of this PR is to "fix" regression brought in Spark 3.0.0 - do we really want to retain the current behavior even we define the current behavior as "regression"? If then, can we make clear both behaviors have valid use cases?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #31638:
URL: https://github.com/apache/spark/pull/31638


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823151934


   **[Test build #137683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137683/testReport)** for PR 31638 at commit [`1c26b2a`](https://github.com/apache/spark/commit/1c26b2a93c412559897d63229727315ff493fc98).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zsxwing commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
zsxwing commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r601648719



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +44,23 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {

Review comment:
       We cannot do this. For example, the following test works today. But if you change this, it would be broken.
   
   ```
     test("foo") {
       withTempDir { tempDir =>
         val path = new java.io.File(tempDir, "[]")
         val chk = new java.io.File(tempDir, "chk")
         val q = spark.readStream.format("rate").load().writeStream.format("parquet")
           .trigger(org.apache.spark.sql.streaming.Trigger.Once())
           .option("checkpointLocation", chk.getCanonicalPath)
           .start(path.getCanonicalPath)
         q.awaitTermination()
         assert(SparkHadoopUtil.get.isGlobPath(new Path(path.getCanonicalPath)))
         val fileIndex = spark.read.format("parquet").load(path.getCanonicalPath)
           .queryExecution.executedPlan.collectFirst {
           case f: FileSourceScanExec => f.relation.location
         }.head
         assert(fileIndex.isInstanceOf[MetadataLogFileIndex])
       }
     }
   ```
   
   The confusing part is Hadoop uses the same class `Path` for both normal path and glob path. But that's not something we can change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824050215






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823158491


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42211/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823942366


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42253/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-787221374


   The behavioral change is bound to file data source, right? I prefer adding the source option instead of adding config, because 1) Spark has a bunch of configurations already 2) I'd prefer having smaller range of impact, per source instead of session-wide.
   
   And also, the option should be used carefully (at least we'd be better to indicate to end users), as the metadata in file stream sink output ensures the "correctly written" files are only read. What would happen if they ignore it?
   
   1) files not listed in metadata could be corrupted, and these files are now being read. They may need to also turn on the flag "ignore corrupted" as well.
   
   2) They are no longer able to consider the output directory as "exactly once" - that is "at least one", meaning the output rows can be written multiple times. Without deduplication or consideration of logic on read side, it may result to incorrect output on read side.
   
   Technically this is an existing issue when batch query tries to read from multiple directories or glob path and it includes the output directory on file stream sink. (That said, they could use the glob path as a workaround without adding the new configuration, though I'd agree explicit config is more intuitive.) I'd love to see the proper notice on such case as well since we are here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802030050


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40794/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-806839568


   **[Test build #136525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136525/testReport)** for PR 31638 at commit [`11f6bf0`](https://github.com/apache/spark/commit/11f6bf0adfa2bfec1b12f0262f68c5b5a8868eee).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823907351


   **[Test build #137726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137726/testReport)** for PR 31638 at commit [`8bd9297`](https://github.com/apache/spark/commit/8bd9297da4a0873e0e0bfbc420d215918eae6e96).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808072941


   The glob path is valid for the path of data source, but I'm not sure I can agree it's also valid for the parameter of `FileStreamSink.hasMetadata()`.
   
   I couldn't imagine the "correct" behavior and return value when the glob path is provided. Let's say there're `/output/a` and `/output/b`, and only `/output/b` was created with streaming query so having metadata directory.
   
   When we provide `/output/*` as a glob path on path, what would we expect? I see three possible behaviors:
   
   1) Leverage metadata in `/output/b` for reading `/output/b` and read `/output/a` via listing. Sounds ideal but not sure Spark now does it.
   2) Leverage metadata in `/output/b` for reading `/output/b`. `/output/a` is silently ignored.
   3) Don't leverage metadata in `/output/b` and read both directories via listing. 
   
   Not sure which one Spark does now, but one clear thing for me is that reasoning the return value of `FileStreamSink.hasMetadata("/output/*")` for above case is very hard if it doesn't work the way it always returns false for glob path. If Spark populates the valid paths from glob path and handles these paths individually (that said, the method is called with non-glob path) the result would be pretty clear. Otherwise, I'm confused what we are expecting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823907351


   **[Test build #137726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137726/testReport)** for PR 31638 at commit [`8bd9297`](https://github.com/apache/spark/commit/8bd9297da4a0873e0e0bfbc420d215918eae6e96).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r602649700



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +44,23 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {

Review comment:
       The expectation on user side is important here; they clearly specify that the input path is a "glob path". It seems easier to reason about the behavior we simply ignore metadata if the input path is a glob path. 
   I'm not clear where it is broken here, but I think there could be some sort of arguments that file index should be MetadataLogFileIndex. For me, listing files even for this case is consistent with the cases where glob path matches multiple directories.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zsxwing commented on a change in pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
zsxwing commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r601648719



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -43,12 +44,23 @@ object FileStreamSink extends Logging {
     path match {
       case Seq(singlePath) =>
         val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
+        if (SparkHadoopUtil.get.isGlobPath(hdfsPath)) {

Review comment:
       We cannot do this. For example, the following test works today. But if you changed this, it would be broken.
   
   ```
     test("foo") {
       withTempDir { tempDir =>
         val path = new java.io.File(tempDir, "[]")
         val chk = new java.io.File(tempDir, "chk")
         val q = spark.readStream.format("rate").load().writeStream.format("parquet")
           .trigger(org.apache.spark.sql.streaming.Trigger.Once())
           .option("checkpointLocation", chk.getCanonicalPath)
           .start(path.getCanonicalPath)
         q.awaitTermination()
         assert(SparkHadoopUtil.get.isGlobPath(new Path(path.getCanonicalPath)))
         val fileIndex = spark.read.format("parquet").load(path.getCanonicalPath)
           .queryExecution.executedPlan.collectFirst {
           case f: FileSourceScanExec => f.relation.location
         }.head
         assert(fileIndex.isInstanceOf[MetadataLogFileIndex])
       }
     }
   ```
   
   The confusing part is Hadoop uses the same class `Path` for both normal path and glob path. But that's not something we can change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596921444



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)
+            } else {
+              false
+            }
+          } catch {
+            case NonFatal(e) if SparkHadoopUtil.get.isGlobPath(hdfsPath) =>
+              // If this is a glob path, the failure is likely because the path is too long.
+              // Since file streaming sink doesn't support glob path, we can assume there is no
+              // metadata directory.
+              logWarning(s"Assume no metadata directory. Error while looking for " +

Review comment:
       Make sense. I included the path in the log. d7abb94




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-806971531


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41109/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-807268567


   **[Test build #136525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136525/testReport)** for PR 31638 at commit [`11f6bf0`](https://github.com/apache/spark/commit/11f6bf0adfa2bfec1b12f0262f68c5b5a8868eee).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596922353



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
##########
@@ -575,6 +576,43 @@ abstract class FileStreamSinkSuite extends StreamTest {
       }
     }
   }
+
+  test("formatCheck flag") {
+    withSQLConf(
+      "fs.file.impl" -> classOf[FailFormatCheckFileSystem].getName,
+      "fs.file.impl.disable.cache" -> "true") {
+      withTempDir { tempDir =>
+        val path = new File(tempDir, "text").getCanonicalPath
+        Seq("foo").toDF.write.format("text").save(path)
+        def assertFail(): Unit = {
+          val e = intercept[IOException] {
+            spark.read.format("text").load(path)
+          }
+          assert(e.getMessage == "cannot access metadata log")

Review comment:
       Thanks, change to contains in d7abb94




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833510820


   **[Test build #138204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138204/testReport)** for PR 31638 at commit [`fa42b0f`](https://github.com/apache/spark/commit/fa42b0fb48270ba75e454f37b7ae654edd296696).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785296286


   **[Test build #135434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135434/testReport)** for PR 31638 at commit [`743b19b`](https://github.com/apache/spark/commit/743b19b2198c6ea16b64f42c97ac2809efc94967).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594942022



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       @HeartSaVioR I will change this PR for only resolving the glob path issue as our discussion. (This should be my fault to mix 2 things together.)
   I'll create another PR for the behavior changing caused by SPARK-26824.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596927316



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       Thanks for the summary. I addressed in d7abb94.
   Also, I tried to add the config in the exception branch and error message, which can give a way for users to fall back to the behavior before SPARK-26824. Please check whether it makes sense for you to add a config in this way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833398593


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42725/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #31638:
URL: https://github.com/apache/spark/pull/31638


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594060528



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       No `isDirectory` and `exists` don't accept glob path, and it will just say "File does not exist". Internally they all call `getFileStatus` which returns information for a _single_ file, e.g., file length, permission, is it a directory or file, etc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r596922117



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {
+      path match {
+        case Seq(singlePath) =>
+          val hdfsPath = new Path(singlePath)
+          try {
+            val fs = hdfsPath.getFileSystem(hadoopConf)
+            if (fs.isDirectory(hdfsPath)) {
+              val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
+              fs.exists(metadataPath)

Review comment:
       existence is all that matters




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833342708


   Sure. Great thanks for your reminder! Rebase to the latest master and force push done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823155907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-823158464


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42211/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31638:
URL: https://github.com/apache/spark/pull/31638#discussion_r594964717



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##########
@@ -40,17 +41,31 @@ object FileStreamSink extends Logging {
    * be read.
    */
   def hasMetadata(path: Seq[String], hadoopConf: Configuration, sqlConf: SQLConf): Boolean = {
-    path match {
-      case Seq(singlePath) =>
-        val hdfsPath = new Path(singlePath)
-        val fs = hdfsPath.getFileSystem(hadoopConf)
-        if (fs.isDirectory(hdfsPath)) {
-          val metadataPath = getMetadataLogPath(fs, hdfsPath, sqlConf)
-          fs.exists(metadataPath)
-        } else {
-          false
-        }
-      case _ => false
+    if (sqlConf.getConf(SQLConf.FILE_SINK_FORMAT_CHECK_ENABLED)) {

Review comment:
       OK. So let's sort it out:
   
   1) filter out the glob path first and simply return false as an "early-return"
   2) restore try-catch to let method tolerate the non-fatal exception
   
   I'm OK to either doing both in here, or one by one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833318356


   **[Test build #138199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138199/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-802004242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-785463113


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135434/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833398593


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42725/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833298347


   Sorry I missed this. I'll try to close and reopen by myself to see whether committer can retrigger the Github Action test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833301868


   @xuanyuanking I'm sorry I need your help on retriggering Github Action. Could you please rebase with master or push empty commit to update the commit here? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-824016386


   **[Test build #137734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137734/testReport)** for PR 31638 at commit [`867363f`](https://github.com/apache/spark/commit/867363fb95aa95617aa9ee8a08d55a39c50d961f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Skip checking glob path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-808613061


   OK I agree with the direction on fixing "regression" only if we have no clear solution to fix the root cause or overall problem. 
   
   Just a one thing I think we'd like to make clear is, the flag on behavior. As we are sure we are fixing regression here, I wonder we need to introduce a flag for that (that was my concern on initial shape of the PR); would someone want to deny fixing regression and leave the behavior as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31638: [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-833359841


   **[Test build #138204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138204/testReport)** for PR 31638 at commit [`fa42b0f`](https://github.com/apache/spark/commit/fa42b0fb48270ba75e454f37b7ae654edd296696).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org