You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/12/26 04:42:52 UTC

[GitHub] spark pull request #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scal...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20081

    [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples

    ## What changes were proposed in this pull request?
    Some improvements:
    1. Point out we are using both Spark SQ native syntax and HQL syntax in the example
    2. Avoid using the same table name with temp view, to not confuse users.
    3. Create the external hive table with a directory that already has data, which is a more common use case.
    4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config was introduced by https://github.com/apache/spark/pull/8566 and has nothing to do with Hive.
    5. Remove `repartition` and `coalesce` example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, `spark.sql.files.maxRecordsPerFile` also controls it.
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20081
    
----
commit 10a80b272e898043e250c2b24a792c9474cf0d10
Author: Wenchen Fan <we...@...>
Date:   2017-12-26T04:30:10Z

    clean up

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    > spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.
    
    Well, that's really an undocumented feature... Can you submit a PR to update the description of `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` and add a test?
    
    > repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.
    
    Yea I know, but that's not accurate. It assumes each task would output one file, which is not true if `spark.sql.files.maxRecordsPerFile` is set to a small number. Anyway this is not a Hive feature, we should probably put it in the `SQL Programming Guide`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    **[Test build #85392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)** for PR 20081 at commit [`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    **[Test build #85392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)** for PR 20081 at commit [`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    @cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scal...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20081


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85392/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    @chetkhatri  @srowen @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    FYI, there is a JIRA for a doc about `spark.sql.parquet.writeLegacyFormat ` - https://issues-test.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20937


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    @cloud-fan @srowen I am good with changes proposed. please do merge.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...

Posted by chetkhatri <gi...@git.apache.org>.

Github user chetkhatri commented on the issue:

    https://github.com/apache/spark/pull/20081
  
    @cloud-fan Thanks for PR
    4. spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data.
    5. repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data.
    i.e  If Data volume is very huge, then every partitions would have many small-small files which may harm
        downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O.
    Else I am good this your approach. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org