You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2016/05/03 01:35:30 UTC

[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/12856

    [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is not partitioning scheme in the given paths

    ## What changes were proposed in this pull request?
    Lets says there are json files in the following directories structure
    ```
    xyz/file0.json
    xyz/subdir1/file1.json
    xyz/subdir2/file2.json
    xyz/subdir1/subsubdir1/file3.json
    ```
    `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. 
    
    The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-14997

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12856.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12856
    
----
commit b1f82ce389721c60bf22694470c52596f3ed9bcc
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-05-03T00:29:10Z

    Fixed SPARK-14997

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61830736
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -291,8 +291,12 @@ class HDFSFileCatalog(
       refresh()
     
       override def listFiles(filters: Seq[Expression]): Seq[Partition] = {
    +
         if (partitionSpec().partitionColumns.isEmpty) {
    -      Partition(InternalRow.empty, allFiles().filterNot(_.getPath.getName startsWith "_")) :: Nil
    +      Partition(
    +        InternalRow.empty,
    +        unpartitionedDataFiles().filterNot(_.getPath.getName startsWith "_")
    --- End diff --
    
    Can we call `allFiles` at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217570089
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61972552
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    --- End diff --
    
    If we use overwrite, we will delete `subsubdir`. Is it what we want?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62143147
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    How about we add a test to make sure all paths in leafFiles contain the scheme?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216485397
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57618/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216616837
  
    **[Test build #57650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57650/consoleFull)** for PR 12856 at commit [`f1b793a`](https://github.com/apache/spark/commit/f1b793a0bd3580d83f8ae8ebf80954c887fa6917).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216430637
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216642383
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62296796
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    seems we still need to update this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217571587
  
    Thanks! Merging to master and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217005064
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62142699
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    --- End diff --
    
    Thanks. How about add the reason at here as a comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216642388
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57650/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61976162
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -337,7 +337,34 @@ class HDFSFileCatalog(
         }
       }
     
    -  def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  /**
    +   * All the files to consider for processing. If there is a partitioning scheme, then
    +   * consider all the leaf files in the input paths. Else consider only the input paths
    +   * (if a path is file) or their immediate children (if a path is a directory).
    +   */
    +  def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse { leafFiles.get(path).map(Array(_)) }
    --- End diff --
    
    I am wondering if paths in `leafFiles` only contain qualified paths?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217330896
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57941/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61972362
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    --- End diff --
    
    When will we reach here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216452616
  
    **[Test build #57604 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57604/consoleFull)** for PR 12856 at commit [`2f7c523`](https://github.com/apache/spark/commit/2f7c52393c137db633726f8bacb2dc552bf31c65).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217330800
  
    **[Test build #57941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57941/consoleFull)** for PR 12856 at commit [`9a64496`](https://github.com/apache/spark/commit/9a64496f9cdea03f839504da1807deed9b123b2a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FileCatalogSuite extends SharedSQLContext `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216457345
  
    @rxin nope this does not fix that. I tested the code on the JIRA you gave on this branch, and did not solve the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217573110
  
    @yhuai Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216485270
  
    **[Test build #57618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57618/consoleFull)** for PR 12856 at commit [`4198d56`](https://github.com/apache/spark/commit/4198d560e0debc93df7675fbddc64b35a4e3adc5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217412976
  
    **[Test build #57977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57977/consoleFull)** for PR 12856 at commit [`7c5d7ba`](https://github.com/apache/spark/commit/7c5d7ba0d81a7a617fee80a7a19a2fb08f6d9bbf).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217413067
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57977/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217544431
  
    LGTM! Those tests are awesome!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217549885
  
    **[Test build #58027 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58027/consoleFull)** for PR 12856 at commit [`8abc999`](https://github.com/apache/spark/commit/8abc99901f864d142b76386981e7bcdbd2be0b64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61975859
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    --- End diff --
    
    Aah... i should fix that and test later whether reading both cases dir and subdir works as expected


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61836088
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -443,6 +453,22 @@ class HDFSFileCatalog(
         }
       }
     
    +  /** List of files to consider when there is not inferred partitioning scheme */
    +  private def unpartitionedDataFiles(): Seq[FileStatus] = {
    +    // For each of the input paths, get the list of files inside them
    +    paths.flatMap { path =>
    +      // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +      val fs = path.getFileSystem(hadoopConf)
    +      val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +      // If it is a directory (i.e. exists in leafDirToChildrenFiles), return its children files
    +      // Or if it is a file (i.e. exists in leafFiles), return the path itself
    +      leafDirToChildrenFiles.get(qualifiedPath).orElse {
    --- End diff --
    
    nvm. This logic is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61975374
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    --- End diff --
    
    The` SimpleTextSource`, used for testing purposes only, does not support schema inference, it expects the option `dataSchema` to be present. Without this case, `SimpleTextHadoopFsRelationSuite` fails this test with 
    ```
    org.scalatest.exceptions.TestFailedException: Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.util.NoSuchElementException was thrown.
    
    sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Expected exception org.apache.spark.sql.AnalysisException to be thrown, but java.util.NoSuchElementException was thrown.
    	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
    	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    	at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
    	at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$23$$anonfun$apply$mcV$sp$36.apply(HadoopFsRelationTest.scala:519)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$23$$anonfun$apply$mcV$sp$36.apply(HadoopFsRelationTest.scala:492)
    	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest.withTempPath(HadoopFsRelationTest.scala:39)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$23.apply$mcV$sp(HadoopFsRelationTest.scala:492)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$23.apply(HadoopFsRelationTest.scala:492)
    	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$23.apply(HadoopFsRelationTest.scala:492)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62296892
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    how about we add a test that will not pass if we use `path` at here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217055512
  
    **[Test build #57827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57827/consoleFull)** for PR 12856 at commit [`f15ee32`](https://github.com/apache/spark/commit/f15ee3217b70da6282238faba2c6f97492b4116e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217532164
  
    **[Test build #58017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58017/consoleFull)** for PR 12856 at commit [`33a1345`](https://github.com/apache/spark/commit/33a1345952410713d1cf00e974028bd00267536d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217330895
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217399925
  
    @yhuai the path to qualifiedPath fix got lost in an incorrect merge. you are right that this is prone to error, so I added a unit test in FileCatalogSuite that fails in if qualified paths are not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61977276
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(dir)
    +
    +      // Should give only rows from partitionedTestDF2
    +      testWithPath(dir, dataInDir.collect())
    +    }
    +  }
    +
    +  test("Hadoop style globbing - unpartitioned data") {
    +    withTempPath { file =>
    +
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +      val subsubdir = new File(subdir, "subsubdir").getCanonicalPath
    +      val anotherSubsubdir =
    +        new File(new File(dir, "another-subdir"), "another-subsubdir").getCanonicalPath
    +
    +      val dataInSubdir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubsubdir = Seq(4, 5, 6).toDF("value")
    +      val dataInAnotherSubsubdir = Seq(7, 8, 9).toDF("value")
    +
    +      dataInSubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subdir)
    +
    +      dataInSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subsubdir)
    +
    +      dataInAnotherSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (anotherSubsubdir)
    +
    +      /*
    +
    --- End diff --
    
    remove empty lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61977059
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(dir)
    +
    +      // Should give only rows from partitionedTestDF2
    +      testWithPath(dir, dataInDir.collect())
    +    }
    +  }
    +
    +  test("Hadoop style globbing - unpartitioned data") {
    +    withTempPath { file =>
    +
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +      val subsubdir = new File(subdir, "subsubdir").getCanonicalPath
    +      val anotherSubsubdir =
    +        new File(new File(dir, "another-subdir"), "another-subsubdir").getCanonicalPath
    +
    +      val dataInSubdir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubsubdir = Seq(4, 5, 6).toDF("value")
    +      val dataInAnotherSubsubdir = Seq(7, 8, 9).toDF("value")
    +
    +      dataInSubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subdir)
    +
    +      dataInSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subsubdir)
    +
    +      dataInAnotherSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (anotherSubsubdir)
    --- End diff --
    
    Yeah. i get the concern. I didnt realize Overwrite might delete subdirs. I should check the dir structures


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62382369
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.test.SharedSQLContext
    +
    +class FileCatalogSuite extends SharedSQLContext {
    +
    +  test("ListingFileCatalog: leaf files are qualified paths") {
    +    withTempDir { dir =>
    +      val file = new File(dir, "text.txt")
    +      stringToFile(file, "text")
    +
    +      val path = new Path(file.getCanonicalPath)
    +      val catalog = new ListingFileCatalog(sqlContext.sparkSession, Seq(path), Map.empty, None) {
    +        def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
    +        def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
    +      }
    +      assert(catalog.leafFilePaths.forall(p => p.toString.startsWith("file:/")))
    +      assert(catalog.leafDirPaths.forall(p => p.toString.startsWith("file:/")))
    +    }
    +  }
    +
    +  test("ListingFileCatalog: input paths are converted to qualified paths") {
    +    withTempDir { dir =>
    +      val file = new File(dir, "text.txt")
    +      stringToFile(file, "text")
    +
    +      val unqualifiedDirPath = new Path(dir.getCanonicalPath)
    +      val unqualifiedFilePath = new Path(file.getCanonicalPath)
    +      require(!unqualifiedDirPath.toString.contains("file:"))
    +      require(!unqualifiedFilePath.toString.contains("file:"))
    +
    +      val fs = unqualifiedDirPath.getFileSystem(sparkContext.hadoopConfiguration)
    +      val qualifiedFilePath = fs.makeQualified(new Path(file.getCanonicalPath))
    +      require(qualifiedFilePath.toString.startsWith("file:"))
    +
    +      val catalog1 = new ListingFileCatalog(
    +        sqlContext.sparkSession, Seq(unqualifiedDirPath), Map.empty, None)
    +      assert(catalog1.allFiles.map(_.getPath) === Seq(qualifiedFilePath))
    +
    +      val catalog2 = new ListingFileCatalog(
    +        sqlContext.sparkSession, Seq(unqualifiedDirPath), Map.empty, None)
    --- End diff --
    
    What is the difference between this `catalog2` and `catalog1`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61977265
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(dir)
    +
    +      // Should give only rows from partitionedTestDF2
    +      testWithPath(dir, dataInDir.collect())
    +    }
    +  }
    +
    +  test("Hadoop style globbing - unpartitioned data") {
    +    withTempPath { file =>
    +
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +      val subsubdir = new File(subdir, "subsubdir").getCanonicalPath
    +      val anotherSubsubdir =
    +        new File(new File(dir, "another-subdir"), "another-subsubdir").getCanonicalPath
    +
    +      val dataInSubdir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubsubdir = Seq(4, 5, 6).toDF("value")
    +      val dataInAnotherSubsubdir = Seq(7, 8, 9).toDF("value")
    +
    +      dataInSubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subdir)
    +
    +      dataInSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subsubdir)
    +
    +      dataInAnotherSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (anotherSubsubdir)
    +
    +      /*
    +
    +        Directory structure generated
    +
    +        dir
    +          |
    +          |___ subdir
    +          |     |
    +          |     |___ [ files of dataInSubdir ]
    +          |     |
    +          |     |___ subsubdir
    +          |               |
    +          |               |___ [ files of dataInSubsubdir ]
    +          |
    +          |
    +          |___ anotherSubdir
    +                |
    +                |___ anotherSubsubdir
    +                          |
    +                          |___ [ files of dataInAnotherSubsubdir ]
    +       */
    +
    +      val schema = dataInSubdir.schema
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedDf: DataFrame): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedDf)
    +      }
    +
    +      testWithPath(s"$dir/*/", dataInSubdir)
    +      testWithPath(s"$dir/sub*/*", dataInSubdir.union(dataInSubsubdir))
    +      testWithPath(s"$dir/another*/*", dataInAnotherSubsubdir)
    +      testWithPath(s"$dir/*/another*", dataInAnotherSubsubdir)
    +      testWithPath(s"$dir/*/*", dataInSubdir.union(dataInSubsubdir).union(dataInAnotherSubsubdir))
    +
    --- End diff --
    
    remove empty lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217343068
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61975674
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    --- End diff --
    
    well probably it should run with all data sources to make sure no data source violates this rule. this is not really partition discovery but more like what to do when there is not partition to discover. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216431720
  
    Would this also resolve https://issues.apache.org/jira/browse/SPARK-14463?filter=12335640 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61833537
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -337,7 +341,13 @@ class HDFSFileCatalog(
         }
       }
     
    -  def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      unpartitionedDataFiles()
    --- End diff --
    
    Maybe add some comments at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61976246
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { file =>
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +               |
    +               |___ subsubdir
    +                         |
    +                         |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      partitionedTestDF1.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    +
    +        case _: java.util.NoSuchElementException if e.getMessage.contains("dataSchema") =>
    +          // ignore, the source format requires schema to be provided by user
    +
    +        case _ =>
    +          fail("Unexpected error trying to infer schema from empty dir", e)
    +      }
    +
    +      /** Test whether data is read with the given path matches the expected answer */
    +      def testWithPath(path: String, expectedAnswer: Seq[Row]): Unit = {
    +        val df = sqlContext.read
    +          .format(dataSourceName)
    +          .schema(dataInDir.schema) // avoid schema inference for any format
    +          .load(path)
    +        checkAnswer(df, expectedAnswer)
    +      }
    +
    +      // Reading by the path 'file/' *not by 'file/subdir') should give empty results
    +      // as there are no files in 'file' and it should not pick up files in 'file/subdir'
    +      testWithPath(dir, Seq.empty)
    +
    +      dataInDir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(dir)
    +
    +      // Should give only rows from partitionedTestDF2
    +      testWithPath(dir, dataInDir.collect())
    +    }
    +  }
    +
    +  test("Hadoop style globbing - unpartitioned data") {
    +    withTempPath { file =>
    +
    +      val dir = file.getCanonicalPath
    +      val subdir = new File(dir, "subdir").getCanonicalPath
    +      val subsubdir = new File(subdir, "subsubdir").getCanonicalPath
    +      val anotherSubsubdir =
    +        new File(new File(dir, "another-subdir"), "another-subsubdir").getCanonicalPath
    +
    +      val dataInSubdir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubsubdir = Seq(4, 5, 6).toDF("value")
    +      val dataInAnotherSubsubdir = Seq(7, 8, 9).toDF("value")
    +
    +      dataInSubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subdir)
    +
    +      dataInSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (subsubdir)
    +
    +      dataInAnotherSubsubdir.write
    +        .format (dataSourceName)
    +        .mode (SaveMode.Overwrite)
    +        .save (anotherSubsubdir)
    --- End diff --
    
    Maybe manually check if we are generating the desired dir structure? (I think we will. But, it is good to double check.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62382277
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.File
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.test.SharedSQLContext
    +
    +class FileCatalogSuite extends SharedSQLContext {
    +
    +  test("ListingFileCatalog: leaf files are qualified paths") {
    +    withTempDir { dir =>
    +      val file = new File(dir, "text.txt")
    +      stringToFile(file, "text")
    +
    +      val path = new Path(file.getCanonicalPath)
    +      val catalog = new ListingFileCatalog(sqlContext.sparkSession, Seq(path), Map.empty, None) {
    +        def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
    +        def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
    +      }
    +      assert(catalog.leafFilePaths.forall(p => p.toString.startsWith("file:/")))
    +      assert(catalog.leafDirPaths.forall(p => p.toString.startsWith("file:/")))
    +    }
    +  }
    +
    +  test("ListingFileCatalog: input paths are converted to qualified paths") {
    --- End diff --
    
    Nice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217323714
  
    **[Test build #57941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57941/consoleFull)** for PR 12856 at commit [`9a64496`](https://github.com/apache/spark/commit/9a64496f9cdea03f839504da1807deed9b123b2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217551680
  
    **[Test build #58017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58017/consoleFull)** for PR 12856 at commit [`33a1345`](https://github.com/apache/spark/commit/33a1345952410713d1cf00e974028bd00267536d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62142173
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    Should we use `qualifiedPath` instead of `path`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217065015
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57827/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216419041
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216430638
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57585/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61831035
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -443,6 +453,22 @@ class HDFSFileCatalog(
         }
       }
     
    +  /** List of files to consider when there is not inferred partitioning scheme */
    +  private def unpartitionedDataFiles(): Seq[FileStatus] = {
    +    // For each of the input paths, get the list of files inside them
    +    paths.flatMap { path =>
    +      // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +      val fs = path.getFileSystem(hadoopConf)
    +      val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +      // If it is a directory (i.e. exists in leafDirToChildrenFiles), return its children files
    +      // Or if it is a file (i.e. exists in leafFiles), return the path itself
    +      leafDirToChildrenFiles.get(qualifiedPath).orElse {
    --- End diff --
    
    If we have multiple layers of sub-directories, seems `get(qualifiedPath)` will always be false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61972415
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,143 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    --- End diff --
    
    Should we put it in partition discovery suite? If we put it at here, we will run it with every data source, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217005067
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57795/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217065010
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217336202
  
    **[Test build #57951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57951/consoleFull)** for PR 12856 at commit [`8c06d4e`](https://github.com/apache/spark/commit/8c06d4ea4516e1760958b08f6e6b77d1811dfd46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216430585
  
    **[Test build #57585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57585/consoleFull)** for PR 12856 at commit [`3bb42bd`](https://github.com/apache/spark/commit/3bb42bdcfe4ddbe439652884eaa8138e5b40f2c8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217413061
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216485396
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61971847
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -337,7 +337,34 @@ class HDFSFileCatalog(
         }
       }
     
    -  def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  /**
    +   * All the files to consider for processing. If there is a partitioning scheme, then
    +   * consider all the leaf files in the input paths. Else consider only the input paths
    +   * (if a path is file) or their immediate children (if a path is a directory).
    +   */
    +  def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse { leafFiles.get(path).map(Array(_)) }
    --- End diff --
    
    When we reach this block, the path is a file that explicitly passed in by the user, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216426109
  
    **[Test build #57585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57585/consoleFull)** for PR 12856 at commit [`3bb42bd`](https://github.com/apache/spark/commit/3bb42bdcfe4ddbe439652884eaa8138e5b40f2c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217569891
  
    **[Test build #58027 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58027/consoleFull)** for PR 12856 at commit [`8abc999`](https://github.com/apache/spark/commit/8abc99901f864d142b76386981e7bcdbd2be0b64).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217343071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57951/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217004902
  
    **[Test build #57795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57795/consoleFull)** for PR 12856 at commit [`efd261f`](https://github.com/apache/spark/commit/efd261fccd26ebdf6f5e6ee4750387005c7ba79d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216461567
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57604/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217064841
  
    **[Test build #57827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57827/consoleFull)** for PR 12856 at commit [`f15ee32`](https://github.com/apache/spark/commit/f15ee3217b70da6282238faba2c6f97492b4116e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216441995
  
    (Maybe adding "Closes #12774" in the description?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217342999
  
    **[Test build #57951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57951/consoleFull)** for PR 12856 at commit [`8c06d4e`](https://github.com/apache/spark/commit/8c06d4ea4516e1760958b08f6e6b77d1811dfd46).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62142868
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,151 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { dir =>
    +      val subdir = new File(dir, "subdir")
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +          |
    +          |___ subsubdir
    +                    |
    +                    |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      dataInSubdir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir.getCanonicalPath)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir.getCanonicalPath)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    --- End diff --
    
    Where will this error be thrown?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216412598
  
    **[Test build #57580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57580/consoleFull)** for PR 12856 at commit [`b1f82ce`](https://github.com/apache/spark/commit/b1f82ce389721c60bf22694470c52596f3ed9bcc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216461565
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217552263
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12856


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62142196
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    `leafFiles` contain all qualified paths, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216471289
  
    **[Test build #57618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57618/consoleFull)** for PR 12856 at commit [`4198d56`](https://github.com/apache/spark/commit/4198d560e0debc93df7675fbddc64b35a4e3adc5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216461457
  
    **[Test build #57604 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57604/consoleFull)** for PR 12856 at commit [`2f7c523`](https://github.com/apache/spark/commit/2f7c52393c137db633726f8bacb2dc552bf31c65).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217552265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58017/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61830983
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -291,8 +291,12 @@ class HDFSFileCatalog(
       refresh()
     
       override def listFiles(filters: Seq[Expression]): Seq[Partition] = {
    +
         if (partitionSpec().partitionColumns.isEmpty) {
    -      Partition(InternalRow.empty, allFiles().filterNot(_.getPath.getName startsWith "_")) :: Nil
    +      Partition(
    +        InternalRow.empty,
    +        unpartitionedDataFiles().filterNot(_.getPath.getName startsWith "_")
    --- End diff --
    
    I dont know for sure. if there is a partitioning scheme, there is an additional filter on files that start with "_" in `listFiles`, that does not seem to be present in `allFiles`. So I am not sure where its best to merge.
    
    Also, I think this way is slightly cleaner than listFiles conditionally depending on allFiles.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216419046
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57580/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216992615
  
    **[Test build #57795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57795/consoleFull)** for PR 12856 at commit [`efd261f`](https://github.com/apache/spark/commit/efd261fccd26ebdf6f5e6ee4750387005c7ba79d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62142360
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    Oh right. I forgot to address that comment. Sorry!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217400432
  
    **[Test build #57977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57977/consoleFull)** for PR 12856 at commit [`7c5d7ba`](https://github.com/apache/spark/commit/7c5d7ba0d81a7a617fee80a7a19a2fb08f6d9bbf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62279016
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala ---
    @@ -61,7 +61,31 @@ abstract class PartitioningAwareFileCatalog(
         }
       }
     
    -  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  override def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse {
    +            leafFiles.get(path).map(Array(_))
    --- End diff --
    
    Added a new FileCatalogSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r61975072
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -337,7 +337,34 @@ class HDFSFileCatalog(
         }
       }
     
    -  def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
    +  /**
    +   * All the files to consider for processing. If there is a partitioning scheme, then
    +   * consider all the leaf files in the input paths. Else consider only the input paths
    +   * (if a path is file) or their immediate children (if a path is a directory).
    +   */
    +  def allFiles(): Seq[FileStatus] = {
    +    if (partitionSpec().partitionColumns.isEmpty) {
    +      // For each of the input paths, get the list of files inside them
    +      paths.flatMap { path =>
    +        // Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel).
    +        val fs = path.getFileSystem(hadoopConf)
    +        val qualifiedPath = path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +
    +        // There are three cases possible with each path
    +        // 1. The path is a directory and has children files in it. Then it must be present in
    +        //    leafDirToChildrenFiles as those children files will have been found as leaf files.
    +        //    Find its children files from leafDirToChildrenFiles and include them.
    +        // 2. The path is a file, then it will be present in leafFiles. Include this path.
    +        // 3. The path is a directory, but has no children files. Do not include this path.
    +
    +        leafDirToChildrenFiles.get(qualifiedPath)
    +          .orElse { leafFiles.get(path).map(Array(_)) }
    --- End diff --
    
    yes. though it should probably be `qualifiedPath` instead of `path`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12856#discussion_r62278940
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala ---
    @@ -486,7 +488,151 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
         }
       }
     
    -  test("Hadoop style globbing") {
    +  test("load() - with directory of unpartitioned data in nested subdirs") {
    +    withTempPath { dir =>
    +      val subdir = new File(dir, "subdir")
    +
    +      val dataInDir = Seq(1, 2, 3).toDF("value")
    +      val dataInSubdir = Seq(4, 5, 6).toDF("value")
    +
    +      /*
    +
    +        Directory structure to be generated
    +
    +        dir
    +          |
    +          |___ [ files of dataInDir ]
    +          |
    +          |___ subsubdir
    +                    |
    +                    |___ [ files of dataInSubdir ]
    +      */
    +
    +      // Generated dataInSubdir, not data in dir
    +      dataInSubdir.write
    +        .format(dataSourceName)
    +        .mode(SaveMode.Overwrite)
    +        .save(subdir.getCanonicalPath)
    +
    +      // Inferring schema should throw error as it should not find any file to infer
    +      val e = intercept[Exception] {
    +        sqlContext.read.format(dataSourceName).load(dir.getCanonicalPath)
    +      }
    +
    +      e match {
    +        case _: AnalysisException =>
    +          assert(e.getMessage.contains("infer"))
    --- End diff --
    
    I think here 
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L269 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-217570091
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58027/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216418933
  
    **[Test build #57580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57580/consoleFull)** for PR 12856 at commit [`b1f82ce`](https://github.com/apache/spark/commit/b1f82ce389721c60bf22694470c52596f3ed9bcc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14997][SQL] Fixed FileCatalog to return...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12856#issuecomment-216642066
  
    **[Test build #57650 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57650/consoleFull)** for PR 12856 at commit [`f1b793a`](https://github.com/apache/spark/commit/f1b793a0bd3580d83f8ae8ebf80954c887fa6917).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org