You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by navis <gi...@git.apache.org> on 2016/01/04 13:42:54 UTC

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

GitHub user navis opened a pull request:

    https://github.com/apache/spark/pull/10572

    SPARK-12619 Combine small files in a hadoop directory into single split

    When a directory contains too many (small) files, whole spark cluster will be exhausted scheduling tasks created for each file. Custom input format can handle that but if you're using hive metastore, it could hardly be an option.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/navis/spark SPARK-12619

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10572.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10572
    
----
commit 055f6135aaa73ab1ff12fa33a77d9a776063336f
Author: navis.ryu <na...@apache.org>
Date:   2016-01-04T12:42:10Z

    SPARK-12619 Combine small files in a hadoop directory into single split

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-168687353
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48661/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10572: SPARK-12619 Combine small files in a hadoop directory in...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/10572
  
    @HyukjinKwon 
    To merge small files, should I tune `spark.sql.files.maxPartitionBytes`? But IIUC it only works for `FileSourceScanExec`. So when I select from hive table, it doesn't work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10572: SPARK-12619 Combine small files in a hadoop directory in...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/10572
  
    Is that https://github.com/apache/spark/pull/12095?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-168670604
  
    **[Test build #48661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48661/consoleFull)** for PR 10572 at commit [`055f613`](https://github.com/apache/spark/commit/055f6135aaa73ab1ff12fa33a77d9a776063336f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-169216149
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #10572: SPARK-12619 Combine small files in a hadoop direc...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10572


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-169216150
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48804/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-168687208
  
    **[Test build #48661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48661/consoleFull)** for PR 10572 at commit [`055f613`](https://github.com/apache/spark/commit/055f6135aaa73ab1ff12fa33a77d9a776063336f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class SimpleCombiner<K, V> implements InputFormat<K, V> `
      * `  public static class InputSplits implements InputSplit, Configurable `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-212184486
  
    Maybe we might have to correct the title just like the others, `[SPARK-XXXX][SQL]` (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-169216072
  
    **[Test build #48804 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48804/consoleFull)** for PR 10572 at commit [`e056332`](https://github.com/apache/spark/commit/e056332a0101a0ede92b28a76c7379d79e88a9e5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class SimpleCombiner<K, V> implements InputFormat<K, V> `
      * `  public static class InputSplits implements InputSplit, Configurable `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10572: SPARK-12619 Combine small files in a hadoop directory in...

Posted by cerisier <gi...@git.apache.org>.

Github user cerisier commented on the issue:

    https://github.com/apache/spark/pull/10572
  
    @davies do you have the commit that fixes this in 2.0 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10572: SPARK-12619 Combine small files in a hadoop directory in...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/10572
  
    @jinxing64 Yup but I think I intended to say most cases within Spark datasources are covered by it. Empty files could be skipped by `spark.hadoopRDD.ignoreEmptySplits` and probably most other cases could be covered by `InputFormat` control though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-169183501
  
    **[Test build #48804 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48804/consoleFull)** for PR 10572 at commit [`e056332`](https://github.com/apache/spark/commit/e056332a0101a0ede92b28a76c7379d79e88a9e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-12619 Combine small files in a hadoop di...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10572#issuecomment-168687351
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10572: SPARK-12619 Combine small files in a hadoop directory in...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/10572
  
    This is fixed in 2.0, could you close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org