You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wuzhilon <gi...@git.apache.org> on 2017/07/12 07:37:13 UTC

[GitHub] spark pull request #18609: Spark SQL merge small files to big files Update I...

GitHub user wuzhilon opened a pull request:

    https://github.com/apache/spark/pull/18609

    Spark SQL merge small files to big files Update InsertIntoHiveTable.scala

    Merge hive small files into large files, support  orc and text data table storage format
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wuzhilon/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18609.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18609
    
----
commit ba63a06c1b54df903046146a69ca6d1a1acb5bef
Author: wuzhilon <35...@qq.com>
Date:   2017-06-30T06:30:07Z

    Update InsertIntoHiveTable.scala
    
    Merge hive small files into large files, support  orc and text data table storage format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18609: Spark SQL merge small files to big files Update InsertIn...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18609
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18609: Spark SQL merge small files to big files Update I...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18609


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18609: Spark SQL merge small files to big files Update InsertIn...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18609
  
    Can you just repartition your data before writing to the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18609: Spark SQL merge small files to big files Update InsertIn...

Posted by wuzhilon <gi...@git.apache.org>.

Github user wuzhilon commented on the issue:

    https://github.com/apache/spark/pull/18609
  
    I am trying to, now the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18609: Spark SQL merge small files to big files Update InsertIn...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18609
  
    @wuzhilon, could you explain why it is problematic if we just repartition? I didn't understand 
    
    > the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.
    
    I think the approach here is quite poorly implemented and we should close. This at least looks listing files which is quite costly on S3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org