You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sbcd90 <gi...@git.apache.org> on 2016/04/29 06:31:43 UTC

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

GitHub user sbcd90 opened a pull request:

    https://github.com/apache/spark/pull/12774

    [SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json()

    ## What changes were proposed in this pull request?
    
    This PR fixes the issue of "Files in subdirectories are incorrectly considered in sqlContext.read.json()".
    
    An example,
    
    ```
    xyz/file0.json
    xyz/subdir1/file1.json
    xyz/subdir2/file2.json
    xyz/subdir1/subsubdir1/file3.json
    
    
    sqlContext.read.json("xyz") should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
    ```
    
    
    ## How was this patch tested?
    
    unit tests
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sbcd90/spark jsonReadIssue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12774
    
----
commit a69329790648fc53d4cf8cc5be659f6ae1989046
Author: Subhobrata Dey <sb...@gmail.com>
Date:   2016-04-29T04:25:53Z

    [SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json()

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12774#discussion_r61534248
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -376,14 +376,10 @@ class HDFSFileCatalog(
             HadoopFsRelation.shouldFilterOut(name)
           }
     
    -      val (dirs, files) = statuses.partition(_.isDirectory)
    +      val (_, files) = statuses.partition(_.isDirectory)
     
           // It uses [[LinkedHashSet]] since the order of files can affect the results. (SPARK-11500)
    -      if (dirs.isEmpty) {
    -        mutable.LinkedHashSet(files: _*)
    -      } else {
    -        mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath))
    -      }
    +      mutable.LinkedHashSet(files: _*)
    --- End diff --
    
    Are you sure of the difference between 1.6.1 and master? I see this logics are not changed comparing to that [interfaces.scala#L467-L472](https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L467-L472)
    Also, does this still support to read [partitioned tables](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by sbcd90 <gi...@git.apache.org>.

Github user sbcd90 commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-215927369
  
    Hello @HyukjinKwon , I am able to reproduce the same issue even in Spark 1.6.1. I had two files like this
    
    ```
    /test_spark/join1.json
    {"a": 1, "b": 2}
    {"a": 2, "b": 4}
    {"a": 4, "b": 8}
    {"a": 8, "b": 16}
    ```
    ```
    /test_spark/subdir/join2.json
    {"a": 1, "c": 1}
    {"a": 2, "c": 2}
    {"a": 3, "c": 3}
    {"a": 4, "c": 4}
    ```
    I execute the following code snippet in Spark 1.6.1
    
    ```
    package org.apache.spark
    
    import org.apache.spark.sql.SQLContext
    
    object TestApp9 extends App {
      val conf = new SparkConf().setAppName("TestApp9").setMaster("local")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
    
      sqlContext.read.json("/test_spark").show()
    }
    ```
    & the output is
    ```
    +---+---+----+
    |  a|  b|   c|
    +---+---+----+
    |  1|  2|null|
    |  2|  4|null|
    |  4|  8|null|
    |  8| 16|null|
    +---+---+----+
    ```
    So, both files are considered. The issue requires further discussion on what approach to follow to solve it.
    The cause of the issue is the piece of code I have changed. But I'm unsure on what approach to follow to support partitioned tables also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-215627856
  
    (I think "(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)" can be removed in the PR description)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216096227
  
    IMO, the current behavior is expected. If the document is not clear, we should correct the document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216098660
  
    @gatorsmile Does that maybe imply closing this for now and make a JIRA or send a email to dev-mailing list in order to discuss this further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216099597
  
    cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-215624123
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12774


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216411689
  
    @sbcd90 I dont get your example. Your example actually shows that only file `/test_spark/join1.json` is considered in Spark 1.6.1. In Spark master, this is broken as both files will be considered. The reason for this bug is that in Spark 1.6.1, there were two code paths - [one](https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L67) when partitioning is detected, [another](https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L121) without. This led to the non-partitioning case not consider directories recursively, which is what the behavior should be. 
    
    In current master, after refactoring, there is only one code path, that uses FileCatalog and HDFSFileCatalog, which always returns all the files recursively, even when there is not partitioning scheme in the directory structure. 
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216412492
  
    Here is my version of the fix - https://github.com/apache/spark/pull/12856/files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12774#discussion_r61534612
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -376,14 +376,10 @@ class HDFSFileCatalog(
             HadoopFsRelation.shouldFilterOut(name)
           }
     
    -      val (dirs, files) = statuses.partition(_.isDirectory)
    +      val (_, files) = statuses.partition(_.isDirectory)
     
           // It uses [[LinkedHashSet]] since the order of files can affect the results. (SPARK-11500)
    -      if (dirs.isEmpty) {
    -        mutable.LinkedHashSet(files: _*)
    -      } else {
    -        mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath))
    -      }
    +      mutable.LinkedHashSet(files: _*)
    --- End diff --
    
    Also, I believe there is another method in `HadoopFsRelation` companion object to list up files parallely. This will use this method based on a threshold. I think that should be also corrected if it is really problematic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org