You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/03/28 18:04:18 UTC

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/11999

    [SPARK-14201][SQL] handle empty file for new data source strategy

    ## What changes were proposed in this pull request?
    
    It's possible that users will write empty files to data dir. For example, write data to JSON data source with one line string, and default parallelism is 2, we will write an empty file and a file with all data. The new data source should be able to handle it.
    
    
    ## How was this patch tested?
    
    new tests in `FileSourceStrategySuite`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark empty-file

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11999
    
----
commit 3ab1a187a2878ca0bfcb2b91812ba064aeeda494
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-03-28T15:40:41Z

    handle empty file for new data source strategy

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57605486
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Yeah, that assert is my fault (from when I though my bin packing algorithm had a bug).  We should support reading empty files (if thats valid for the format) and we should avoid writing them when possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202670288
  
    **[Test build #54389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54389/consoleFull)** for PR 11999 at commit [`630e3e9`](https://github.com/apache/spark/commit/630e3e908d6a2096088b821b1633eff6054c1311).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202509442
  
    **[Test build #54324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54324/consoleFull)** for PR 11999 at commit [`3ab1a18`](https://github.com/apache/spark/commit/3ab1a187a2878ca0bfcb2b91812ba064aeeda494).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202740432
  
    The fix is so simple, and we already have tests for empty files, so I'm going to close this PR and the issue is fixed in https://github.com/apache/spark/pull/11960


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57597120
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    I would say that we should skip writing out empty files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202670967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202670971
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54389/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57600937
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Why not filter out the empty files there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57601423
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Actually there is an existing [CSV test case][1] that cover empty CSV input files. I have to [remove the empty file assertion][2] in `FileSourceStrategy` for this test case when implementing `buildReader()` for CSV.
    
    [1]: https://github.com/apache/spark/blob/e5a1b301fbe191f1a9627a1083d960c98f543d13/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L248-L256
    [2]: https://github.com/apache/spark/pull/12002#discussion_r57594679


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57603199
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    It's correct to be here, just confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57599100
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Even we can guarantee that the writer in Spark will not create empty file, but other database (Hive) may do that. We should still handle that in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57600051
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    The new FileSourceStratyge expects non-empty files [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L141)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202509972
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202461010
  
    cc @marmbrus @liancheng @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202462960
  
    **[Test build #54324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54324/consoleFull)** for PR 11999 at commit [`3ab1a18`](https://github.com/apache/spark/commit/3ab1a187a2878ca0bfcb2b91812ba064aeeda494).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202509973
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54324/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57601790
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    I wanna filter out them earlier, so I choose here. Is empty file valid in some cases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57595999
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Where is the place that does not handle empty file properly? Does that also handle file with no records (a csv file has only header line) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/11999


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11999#discussion_r57597049
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
    @@ -549,15 +549,18 @@ class HDFSFileCatalog(
     
       refresh()
     
    +  private def isInvalidFile(f: FileStatus): Boolean =
    +    f.getPath.getName.startsWith("_") || f.getLen == 0
    --- End diff --
    
    Yeah, its not clear to me that we want to skip 0 sized files.  They really shouldn't be there for valid datasets.  What is the motivation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14201][SQL] handle empty file for new d...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11999#issuecomment-202646954
  
    **[Test build #54389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54389/consoleFull)** for PR 11999 at commit [`630e3e9`](https://github.com/apache/spark/commit/630e3e908d6a2096088b821b1633eff6054c1311).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org