You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/17 07:10:49 UTC

[GitHub] [spark] FatalLin opened a new pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

FatalLin opened a new pull request #32202:
URL: https://github.com/apache/spark/pull/32202


   ### What changes were proposed in this pull request?
   I thought that the reader didn't scan the files under sub-directory even the configurtions such like "hive.mapred.supports.subdirectories" or "mapred.input.dir.recursive" is setted, so I developed a function which will list all the sub-directories under the root path when "hive.mapred.supports.subdirectories" is on.
   
   ### Why are the changes needed?
   Hive already have configurations to handle similar issues, but the built-in reader couldn't.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, maybe we could add this option in documents to notice users for the enhancement.
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822583926


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42163/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822019164


   > @FatalLin
   > Some more thoughts/question:
   > 
   > 1. Why are two configs in Hive for this?
   > 
   > * mapred.input.dir.recursive
   > * hive.mapred.supports.subdirectories
   > 
   > 1. How does Hive do when only one is true? If there both needed we need to check both too!
   > 2. Please update the title: skip the part "when configuration is enable" and reword the rest.
   >    What about "Supporting non-partitioned Hive tables with subdirectories".
   > 3. Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
   >    Please note the PR description is extremely important as after the PR is merged it will be the commit message.
   > 4. At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.
   
   got it, it's a great help for me, really appreciated! 
   I'll address all the questions you mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros edited a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros edited a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821806848


   @FatalLin Thanks for your contribution! Welcome here! 
   
   This is not my focus area but I have added some comments. So let's cc. some more competent developers:
   @dongjoon-hyun @viirya 
   
   But I can enable the testing for you.
   
   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-849318208


   Can we use https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822298426


   **[Test build #137598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137598/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615240660



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")
+
+    if (enableSupportSubDirectories.isDefined && enableSupportSubDirectories.get.toBoolean) {
+      val fs = rootPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
+      val paths = new scala.collection.mutable.ListBuffer[String]
+
+      val checkingQueue = new mutable.Queue[Path]()
+      checkingQueue.enqueue(rootPath)
+      while (!checkingQueue.isEmpty) {
+        val path = checkingQueue.dequeue()
+        paths.append(path.toString)
+        fs.listStatus(path).foreach(fileStatus => {
+          if (fileStatus.isDirectory) {
+            checkingQueue.enqueue(fileStatus.getPath)
+          }
+        })
+      }
+      paths

Review comment:
       Why not using the existing `SparkHadoopUtil.listLeafDirStatuses(fs, rootPath)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822170465


   **[Test build #137574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137574/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821815706






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823182821


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821819323


   **[Test build #137505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137505/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822416766


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137598/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822257600


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137574/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821810762


   **[Test build #137505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137505/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chong0929 commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

chong0929 commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-846708661


   I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822567233


   > > > But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros
   > > 
   > > 
   > > The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.
   > 
   > got it, I'll check both configs, thanks!
   
   After a consideration, I decided to add a new config to replace the configs from hive we mentioned earlier.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-820927674


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chong0929 edited a comment on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

chong0929 edited a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-846708661


   I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin edited a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin edited a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822210112

about the configuration "mapred.input.dir.recursive" and "hive.mapred.supports.subdirectories", I found a brief introduction in hive document:
``
hive.mapred.supports.subdirectories
Default Value: false
Added In: Hive 0.10.0 with HIVE-3276
Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501.
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)
``
looks like "mapred.input.dir.recursive" allow map-reduce could read files from sub-directories, and "hive.mapred.supports.subdirectories" allow hive could do some sub-directories related optimization. In my first thought that due to hive and map-reduce is separate project so that's make sense that they have each own configuration about it. But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821817648


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42079/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821819393


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137505/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822246144


   > > But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros
   > 
   > The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.
   
   got it, I'll check both configs, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822257600


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137574/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-846711680


   > I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?
   
   you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
   I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822536476


   **[Test build #137632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137632/testReport)** for PR 32202 at commit [`88afaf6`](https://github.com/apache/spark/commit/88afaf6e9d786dbcf78e529f240f066fe3dbb789).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615241566



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")

Review comment:
       I don't think "hive.mapred.supports.subdirectories" should be a new Spark config. 
   But it should be read from the hadoop configuration so use `hadoopConf.get(<>)` here.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822787823


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137636/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821860620


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros edited a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros edited a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821806848


   @FatalLin Thanks for your contribution! Welcome here! 
   
   This is not my focus area but I have added some comments. So let's cc. some more competent developers:
   @dongjoon-hyun @viirya 
   
   But of course I can enable the testing for you.
   
   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823346469


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137684/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823346469


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137684/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823158600


   **[Test build #137684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137684/testReport)** for PR 32202 at commit [`7cc9c95`](https://github.com/apache/spark/commit/7cc9c959345ce28c3ae380a46388f049f9971ac7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822536476


   **[Test build #137632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137632/testReport)** for PR 32202 at commit [`88afaf6`](https://github.com/apache/spark/commit/88afaf6e9d786dbcf78e529f240f066fe3dbb789).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822251710


   **[Test build #137574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137574/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821817648


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42079/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822416766


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137598/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822583342


   **[Test build #137636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137636/testReport)** for PR 32202 at commit [`72eae96`](https://github.com/apache/spark/commit/72eae960193077e0cb8c9de9cfd9525f8d68cdea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821852931


   **[Test build #137507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137507/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821852931


   **[Test build #137507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137507/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822415138


   **[Test build #137598 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137598/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821869693


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137507/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615271527



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")
+
+    if (enableSupportSubDirectories.isDefined && enableSupportSubDirectories.get.toBoolean) {
+      val fs = rootPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
+      val paths = new scala.collection.mutable.ListBuffer[String]
+
+      val checkingQueue = new mutable.Queue[Path]()
+      checkingQueue.enqueue(rootPath)
+      while (!checkingQueue.isEmpty) {
+        val path = checkingQueue.dequeue()
+        paths.append(path.toString)
+        fs.listStatus(path).foreach(fileStatus => {
+          if (fileStatus.isDirectory) {
+            checkingQueue.enqueue(fileStatus.getPath)
+          }
+        })
+      }
+      paths

Review comment:
       Didn't notice the function before, will change it. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-849649992


   > Can we use https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#recursive-file-lookup?
   
   looks like this question has been replied in another PR.
   https://github.com/apache/spark/pull/32679#discussion_r640499581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615241566



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")

Review comment:
       I don't think "hive.mapred.supports.subdirectories" should be a new Spark config. 
   But it should be read from the hadoop configuration so use `hadoopConf.get("hive.mapred.supports.subdirectories")` here.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822786704


   **[Test build #137636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137636/testReport)** for PR 32202 at commit [`72eae96`](https://github.com/apache/spark/commit/72eae960193077e0cb8c9de9cfd9525f8d68cdea).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615271561



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")

Review comment:
       got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #32202:
URL: https://github.com/apache/spark/pull/32202


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-915655823


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821851862


   I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't  change anything will cause it, does anyone  have idea on it? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821869478


   **[Test build #137507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137507/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823181612


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823158600


   **[Test build #137684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137684/testReport)** for PR 32202 at commit [`7cc9c95`](https://github.com/apache/spark/commit/7cc9c959345ce28c3ae380a46388f049f9971ac7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822583888


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42163/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chong0929 commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

chong0929 commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-847597979


   > > I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?
   > 
   > you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
   > I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".
   
   I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822245098


   > But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros
   
   The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823182821


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42212/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823332230


   **[Test build #137684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137684/testReport)** for PR 32202 at commit [`7cc9c95`](https://github.com/apache/spark/commit/7cc9c959345ce28c3ae380a46388f049f9971ac7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class ImmutableBitSet(val numBits: Int, val bitsToSet: Int*) extends BitSet(numBits) `
     * `  case class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule `
     * `case class DomainJoin(domainAttrs: Seq[Attribute], child: LogicalPlan) extends UnaryNode `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822636981


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42166/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822637018


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42166/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822583342


   **[Test build #137636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137636/testReport)** for PR 32202 at commit [`72eae96`](https://github.com/apache/spark/commit/72eae960193077e0cb8c9de9cfd9525f8d68cdea).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822018123

@FatalLin
Some more thoughts/question:

1. Why are two configs in Hive for this?

- mapred.input.dir.recursive
- hive.mapred.supports.subdirectories

2. How does Hive do when only one is true? If there both needed we need to check both too!

3. Please update the title: skip the part "when configuration is enable" and reword the rest.
What about "Supporting non-partitioned Hive tables with subdirectories".

4. Please update the description, too. In "What changes were proposed in this pull request?" its enough if you explain the the title a bit more. I suggest to use a spell checker to avoid errors like: setted => set, configurtions => configuration.
Please note the PR description is extremely important as after the PR is merged it will be the commit message.

5. At "Does this PR introduce any user-facing change?" elaborate on the impact of this change. Remove the "maybe we could add this option in documents to notice users for the enhancement." which I think is a good idea and should be part of this PR.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821819393


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137505/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821810762


   **[Test build #137505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137505/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822210112


   about the configuration "mapred.input.dir.recursive" and "hive.mapred.supports.subdirectories", I found a  brief introduction in hive document: 
   ``
   hive.mapred.supports.subdirectories
   Default Value: false
   Added In: Hive 0.10.0 with HIVE-3276
   Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501.
   (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)
   ``
   looks like "mapred.input.dir.recursive" allow map-reduce could read files from sub-directories, and "hive.mapred.supports.subdirectories"  allow hive could do some sub-directories related optimization. In my first thought that due to hive and map-reduce is separate project so that's make sense that they have each own configuration about it. But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration  "hive.mapred.supports.subdirectories" earlier. How do you think @attilapiros 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821806848


   @FatalLin Thanks for your contribution! Welcome here! 
   
   This is not my focus area but I have added some comments So let's cc. some more competent developers:
   @dongjoon-hyun @viirya 
   
   But I can enable the testing for you.
   
   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822662133


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137632/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822598265


   > > I didn't get the reason why the Notify test workflow always failed due to some 404 Not Found Exception which I think I didn't change anything will cause it, does anyone have idea on it? Thanks.
   > 
   > @FatalLin - you can rebase to latest master branch, and this error should go away.
   
   it works, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822298426


   **[Test build #137598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137598/testReport)** for PR 32202 at commit [`1818fc5`](https://github.com/apache/spark/commit/1818fc51fff603f35e3281659685ecdf372199d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821858393


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin edited a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin edited a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822567233


   > > > But in spark, the operation is only happened in spark-sql, so I only check hive-side configuration "hive.mapred.supports.subdirectories" earlier. How do you think? @attilapiros
   > > 
   > > 
   > > The original intention of this PR is to be compatible with Hive so I would check both configs as on the same machine I would expect to get the same answers when querying the non partitioned table with subdirectories.
   > 
   > got it, I'll check both configs, thanks!
   
   After a consideration( include studying PR from other dev and rethinking point 4 @attilapiros  mentioned above), I decided to add a new config to replace the configs from hive we mentioned earlier, but I'm not sure is the config name is proper enough(maybe too long I guess). Like always, any feedback is appreciated!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822787823


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137636/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822583926


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42163/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

attilapiros commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-823151771


   cc @peter-toth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821860620


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-821869693


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137507/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

attilapiros commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615417964



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,19 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.sparkContext.
+        hadoopConfiguration.get("hive.mapred.supports.subdirectories", "false")

Review comment:
       This would fit even better:
   https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html#getBoolean-java.lang.String-boolean-




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin commented on a change in pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin commented on a change in pull request #32202:
URL: https://github.com/apache/spark/pull/32202#discussion_r615273292



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -277,6 +278,31 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
     result.copy(output = newOutput)
   }
 
+  private def getDirectoryPathSeq(rootPath: Path): Seq[String] = {
+    val enableSupportSubDirectories =
+      sparkSession.conf.getOption("hive.mapred.supports.subdirectories")

Review comment:
       should we create a new configuration for it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32202: [SPARK-28098]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-820927674


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chong0929 commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories

Posted by GitBox <gi...@apache.org>.

chong0929 commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-851229533


   > > > I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action?
   > > 
   > > 
   > > you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader?
   > > I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories".
   > 
   > I mean there would be an exception: “java.io.IOException: Not a file: hdfs://ns000/{table_name}/month=02/1” if I use spark engine to read a partititioned hive table with subdirectories.
   
   Confirm that there is no exception, but can not get the data in the partition table subdirectory：https://github.com/apache/spark/pull/32679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822170465


   **[Test build #137574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137574/testReport)** for PR 32202 at commit [`eb56adb`](https://github.com/apache/spark/commit/eb56adbb0348334e4ad10413fd72e73053eb901b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822657108


   **[Test build #137632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137632/testReport)** for PR 32202 at commit [`88afaf6`](https://github.com/apache/spark/commit/88afaf6e9d786dbcf78e529f240f066fe3dbb789).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32202:
URL: https://github.com/apache/spark/pull/32202#issuecomment-822637018






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] FatalLin closed pull request #32202: [SPARK-28098][SQL]allow reader could read files from subdirectory for non-partitioned table when configuration is enable

Posted by GitBox <gi...@apache.org>.

FatalLin closed pull request #32202:
URL: https://github.com/apache/spark/pull/32202


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org