You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2016/05/20 19:41:19 UTC

[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/13227

    [SPARK-15454][SQL] Filter out files starting with _

    ## What changes were proposed in this pull request?
    Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files.
    
    ## How was this patch tested?
    Added a unit test case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-15454

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13227
    
----
commit ecb4f02983bdda177bdee36c3dc32f2f57929813
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-05-20T19:37:22Z

    [SPARK-15454][SQL] HadoopFsRelation should filter out files starting with _

commit a3bcddd86184c5178e5a0367d540a51b1089b610
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-05-20T19:40:26Z

    Add test case

commit 447fe4edca1962277124d00f8cc40dcfc5096abe
Author: Reynold Xin <rx...@databricks.com>
Date:   2016-05-20T19:41:03Z

    Improve test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220703887
  
    **[Test build #59020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59020/consoleFull)** for PR 13227 at commit [`705a76f`](https://github.com/apache/spark/commit/705a76f84605f1d54e98caae7f81631bf80c8feb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220700270
  
    cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220737817
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13227#discussion_r64100013
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -341,11 +341,11 @@ private[sql] object HadoopFsRelation extends Logging {
     
       /** Checks if we should filter out this path name. */
       def shouldFilterOut(pathName: String): Boolean = {
    -    // TODO: We should try to filter out all files/dirs starting with "." or "_".
    -    // The only reason that we are not doing it now is that Parquet needs to find those
    -    // metadata files from leaf files returned by this methods. We should refactor
    -    // this logic to not mix metadata files with data files.
    -    pathName == "_SUCCESS" || pathName == "_temporary" || pathName.startsWith(".")
    +    // We filter everything that starts with _ and ., except _common_metadata and _metadata
    +    // because Parquet needs to find those metadata files from leaf files returned by this method.
    +    // We should refactor this logic to not mix metadata files with data files.
    +    (pathName.startsWith("_") || pathName.startsWith(".")) &&
    +      !pathName.startsWith("_common_metadata") && !pathName.startsWith("_metadata")
    --- End diff --
    
    Why `startsWith` instead of `==` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220706484
  
    LGTM except for one minor comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220701184
  
    LGTM, pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220700866
  
    **[Test build #59017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59017/consoleFull)** for PR 13227 at commit [`447fe4e`](https://github.com/apache/spark/commit/447fe4edca1962277124d00f8cc40dcfc5096abe).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220722282
  
    **[Test build #59020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59020/consoleFull)** for PR 13227 at commit [`705a76f`](https://github.com/apache/spark/commit/705a76f84605f1d54e98caae7f81631bf80c8feb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220737818
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59018/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13227#discussion_r64101579
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala ---
    @@ -341,11 +341,11 @@ private[sql] object HadoopFsRelation extends Logging {
     
       /** Checks if we should filter out this path name. */
       def shouldFilterOut(pathName: String): Boolean = {
    -    // TODO: We should try to filter out all files/dirs starting with "." or "_".
    -    // The only reason that we are not doing it now is that Parquet needs to find those
    -    // metadata files from leaf files returned by this methods. We should refactor
    -    // this logic to not mix metadata files with data files.
    -    pathName == "_SUCCESS" || pathName == "_temporary" || pathName.startsWith(".")
    +    // We filter everything that starts with _ and ., except _common_metadata and _metadata
    +    // because Parquet needs to find those metadata files from leaf files returned by this method.
    +    // We should refactor this logic to not mix metadata files with data files.
    +    (pathName.startsWith("_") || pathName.startsWith(".")) &&
    +      !pathName.startsWith("_common_metadata") && !pathName.startsWith("_metadata")
    --- End diff --
    
    just in case we do other variants here ..



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220700872
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220702769
  
    **[Test build #59018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59018/consoleFull)** for PR 13227 at commit [`0d3bc7d`](https://github.com/apache/spark/commit/0d3bc7dd18a92ed73b6adfa4aabf7c56587e35df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220737708
  
    **[Test build #59018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59018/consoleFull)** for PR 13227 at commit [`0d3bc7d`](https://github.com/apache/spark/commit/0d3bc7dd18a92ed73b6adfa4aabf7c56587e35df).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220700876
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59017/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220722524
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220722522
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220700577
  
    **[Test build #59017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59017/consoleFull)** for PR 13227 at commit [`447fe4e`](https://github.com/apache/spark/commit/447fe4edca1962277124d00f8cc40dcfc5096abe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13227


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15454][SQL] Filter out files starting w...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13227#issuecomment-220726644
  
    Merging in master/2.0. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org