You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yanakad <gi...@git.apache.org> on 2015/12/18 14:26:53 UTC

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

GitHub user yanakad opened a pull request:

    https://github.com/apache/spark/pull/10379

    [SPARK-12369][SQL]DataFrameReader fails on globbing parquet paths

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanakad/spark SPARK-12369

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10379
    
----
commit a79c0f277e301a125ac347b8354247757e5710ad
Author: Yana Kadiyska <yk...@akamai.com>
Date:   2015-12-17T17:05:05Z

    Bugfix and test

commit 25c0f41686eb3e9c646321bff6874efa49a46ed3
Author: Yana <ya...@yahoo.com>
Date:   2015-12-18T13:21:39Z

    Scalastyle fixes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165778408
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by yanakad <gi...@git.apache.org>.

Github user yanakad commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-166020526
  
    @liancheng Would logging the fail paths at WARN or ERROR level be an acceptable compromise? I am not sure if you're advising that the fix is not good enough or if you're disagreeing that there is an issue.
    I think the original behavior *is* a problem -- if you have paths like this /root/account=number/date='yyyy-mo'/... , you create a DF at the root level and you execute 'select * where account=nonexistent' you'd get an empty data frame. If you execute a query with where date in(mo1,mo2,mo3) and there is no mo3 partition, you'd still get data for months1 & 2. On the other hand, if you try to create a DF at /root/account=nonexistent you'd get an exception. I have a very heavily partitioned space, which is why I am creating dataframes as low as possible, running into this problem when a partition path is missing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-166088383
  
    @yanakad Thanks for your explanation, now I understand your use case. I agree that this is somewhat inconvenient under this use case. But I still tend to say this shouldn't be an issue, because:
    
    1. At application level, this issue can be worked around by first globbing the lowest directories first, and then passing result path(s) to `DataFrameReader.parquet()` method.
    2. Changes made in this PR bring negative impact to the public API:
    
       - As mentioned above, the behavior becomes more error-prone and dangerous
       - The behavior becomes inconsistent with other data sources. For example, ORC, JSON, and JDBC all throws exception when the input path/JDBC URL is invalid or doesn't exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165959677
  
    @yanakad Thanks for your contribution! However, I'd argue that building a partial DataFrame can be error-prone and dangerous since nonexistent paths are silently ignored without any error/warning. For example, there might be trivial spelling errors in one of the paths, but the user may still think that all the data are loaded correctly without any problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165959742
  
    Also, the PR description is ambiguous. "DataFrameReader fails on globbing parquet paths that contain nonexistent path(s)" might be more accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165799624
  
    **[Test build #2233 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2233/consoleFull)** for PR 10379 at commit [`25c0f41`](https://github.com/apache/spark/commit/25c0f41686eb3e9c646321bff6874efa49a46ed3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10379


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by yanakad <gi...@git.apache.org>.

Github user yanakad commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165789809
  
    @liancheng I think you added this code originally


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12369][SQL]DataFrameReader fails on glo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10379#issuecomment-165844746
  
    **[Test build #2233 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2233/consoleFull)** for PR 10379 at commit [`25c0f41`](https://github.com/apache/spark/commit/25c0f41686eb3e9c646321bff6874efa49a46ed3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org